State timeline visualisation

~V11 and Ubuntu 22, prometheus scraping from file generated from powershell scripts

I have a task that runs periodically, and produces metrics for prometheus, mainly in the form of a start time, and end time, and the process result (0,1,2,-1 etc).

The start metric becomes avialbale at the start, as well as then process code being available as -1; running.
At the end of the job, the result code changes to the result of the job, 0,1,2 etc.

I’m using a state timeline to display both the duration for the job as well as the result of the job itself.

Currently the result of the job is only shown in a different colour( as the result is not longer -1, so can be set to adifferent colour using an override) at the very end of the ‘block’ on the timeline.

I currently have one metric for state, which has it’svalue updated at the dn of the run, so for a 1 hour task, the value is -1 for most (running) and then changes to 0 (success) for one minute at the end.

Is there any way touse a transformation, or any other approach/way of providoing metrics, that would allow me to, upon completing, set the colour of the whole block, from start to finish, based on the final result of the task?

Alternatively, is there a much better way to go about achieving what i want; which is essentially a gant chart, over the period of a week or so, for a set of tasks that all have different start and end times, and different process returns/status’ on completion.

Ideally, i’d have the end status displayed along with the duration of the task together as one

Do you have a label that points to a specific job (like id or pod name or something)? If yes, you could try to do it the following way, if not, I’m not sure if that’s possible:

Disclaimer. I gave up creating some meaningful metrics, I took some that I already have - that’s why there are pokemons, I hope you don’t mind :smile:

  1. First you need two queries - one for the timeline and one for the job status. In my case it was pokemon_caught_total * 0 (times 0 doesn’t really matter I think) and max(max_over_time(pokemon_caught_total[$__range:<resolution of your data>])) by (type) (type in my case is the label I can differentiate on - in your case it should be job_id or pod or something like that, if you have that). The first one is a range query, the second one - the instant query. Both of the queries are formatted as Table (Query options => Format = Table).

  2. In Transformations, I joined both tables (Join by field) with mode INNER by type (the common label). Then, with Organize fields by name I cleaned up labels that I didn’t need (leaving type, time and Value of the second query). The final transformation Prepare time series with format Multi-grame time series allowed me to create time series out of my tables.

  3. In State timeline visualization I untoggled the Merge equal consecutive values toggle and set Disconnect values on Threshold with 15s - my scrape interval.

The result (look at the fire types) after I stared my application, deleted it and started again (like a job):

One thing you should watch out for - if your series won’t be unique in the range, the value will be returned to the max of the codes (like the fire pokemons in the result screen - even though the first part should be of value 1 (like in the right panel), it’s 4, because that’s the max value over time). Though I don’t know how to do it better / if it’s even possible to do so better (maybe it is but I don’t know the way, since prometheus doesn’t let you overwrite values with query easily). Hope that helps or you’ll do something on the base on this!

^^ Thanks for that, it’s helping to point me in the right direction; although the above bit is definately still a stumbling block, as pointed out, the colour of the block (being done by simple value mappings) is given by the highest value of the process return over the given period of time, for example:

If one task runs once a day, for 7 days, and 1 fail, but 6 complete okay, then the colour of all of the blocks is failure/2/whatever your post return code is.

Does anyone know of a way around this?
As good as the above recomendations are, the ‘generalised’ result being given, i.e. the highest value received over the given time, is kind of a deal breaker…

If one task runs once a day

Can you share how the metric looks like? Are those tasks in the same series?

Sure!


^^ Visualised based off of the queries you provided for me, basically the exact same setup.
For the top rop visible, the first task block resulted in a failure, so correctly is red, but the second time only resulted in a warning. Failure process returns are 2, while warnings are 1, so the max_over_time query that gives the block colour is setting the colour/value of both blocks to 2/Fail.

I have metrics:
v_job_general_last_result{instance=“myInstance.local:9182”,job=“Insights”,scheduled=“true”}, jobName is my unique label per task, which is what your ‘type’ example represented. There are 10-15 different tasks that all provide the above metric, so while running, the metrics available would be:

  • v_job_general_last_result{instance=“myInstance.local:9182”,job=“Insights”,scheduled=“true”,jobName=“TestJob1”} -1

  • v_job_general_last_result{instance=“myInstance.local:9182”,job=“Insights”,scheduled=“true”,jobName=“TestJob2”} -1

  • v_job_general_last_result{instance=“myInstance.local:9182”,job=“Insights”,scheduled=“true”,jobName=“TestJob3”} -1

Then, at a point in time when they have all finished running:

  • v_job_general_last_result{instance=“myInstance.local:9182”,job=“Insights”,scheduled=“true”,jobName=“TestJob1”} 0
    ^^ Job success

  • v_job_general_last_result{instance=“myInstance.local:9182”,job=“Insights”,scheduled=“true”,jobName=“TestJob1”} 1
    ^^ Job complete with warning

  • v_job_general_last_result{instance=“myInstance.local:9182”,job=“Insights”,scheduled=“true”,jobName=“TestJob1”} 2
    ^^ Job failed

One minute after the job has completed; the file containing the metric is removed, so the Last value of the metric is the final status of the task.

The dificulty is the tasks all run on a different schedule, some run every 4 hours, some every 12, some every 24 etc. I want this visualsation to be able to show me the last weeks worth of results, using the State Timeline visualisation to give a rough indication of the job duration too (the amount of time the metric is available for roughly fills this purpose, the blocks in the orignal screenshot are fine to indivation a rough duration, although that’'s more of a nice to have than essential)


This is an alternative visualisation i made before, but the status of the task is only represented at the end of the block:


Which isn’t as clear as colouring the whole block of each run.


The metrics are created from powershell by me, so i can be reasonable flexible in what i provide and when, currently also providing start and end times in epoch for each job for example, in the similar format:v_job_general_last_start{instance=“myInstance.local:9182”,job=“Insights”,scheduled=“true”,jobName=“TestJob1”} 123456789 (obviously not a real epoch time but you get the point)

One more question - in time are the jobName labels reusable? From the visualization you’ve shown (the screenshot) each job execution is linked to one time series, is that always the case?

The jobName labels correspond to a fix list of tasks, there are ~12 tasks in total, so 12 different sets of values for jobName; every time a new metric value is provided, the jobName value is re-used. It’s why in the first screenshot some jobs have two blocks in the screenshot, for two different runs of the job.
I hope i’m understanding your question correctly!