So, in that picture, there are 3 labels (grayed out, but they are c, n and s). At each time point on the x axis (equating to Prometheus scraping intervals), each label has a metrics value of 1 (up or green) or 0 (down or red). In this example, there are no down metrics.
What I like to have: a graph/panel that has 2 metrics for each label (so 6 total, for all 3 labels) that keeps being updated, depending on the values on the graph I shared here. The metrics are: #1) Percent of up (green) in a month, #2) Percent of up (green) in a year.
So it’s like a table, 3 rows (the labels, c/n/s) and 3 columns (1st column for the 3 labels, 2nd column is for the metrics #1, 3rd column for the metrics #2).
@pdn
This is just a quick mockup, but the only way I see this working is to have 6 stat panels and a status history graph with 3 rows. Not sure how to do the % uptime calculation with Prometheus, but I would guess someone has done it already.
Thank you @grant2, that’s exactly what I want to have.
Anyone who has done it already, please share how you would do it with prometheus data. I already have data coming in from Prometheus, either 1 (up) or 0 (down), at the pre-defined scraping interval.
and ChatGPT:
To calculate the percentage of uptime using a Prometheus query, you can use the up metric, which is a built-in metric in Prometheus that represents the health of a target (instance). The up metric has a value of 1 when the target is up and healthy, and 0 when the target is down. You can use the avg_over_time() function to get the average uptime over a certain time range.
Here’s a sample query to calculate the percentage of uptime for all instances over the last 1 hour: avg(avg_over_time(up[1h])) * 100
This query will give you the average uptime percentage for all instances being monitored by Prometheus during the last hour. If you want to calculate the percentage of uptime for a specific instance or a group of instances, you can use the instance label in the query: avg(avg_over_time(up{instance="your_instance_name"}[1h])) * 100
Replace your_instance_name with the actual name or IP of the instance you want to calculate the uptime percentage for.
Similarly, if you want to calculate the percentage of uptime for a specific job or service, you can use the job label in the query: avg(avg_over_time(up{job="your_job_name"}[1h])) * 100
Replace your_job_name with the actual name of the job or service you want to calculate the uptime percentage for.
I did google before, but I thought it might be quicker to post in Grafana community, kind of cheating
Thank you again @grant2! I will look more into it. But anyone already has done before, would love to have it!
I struggled to come up with the top 2x3 graph with % values.
Don’t know how to do it, already tried so different things and different graph types. Would be great to see an example, with detailed how-to’s. Much thanks in advance!
@grant2, I have one more question. I want to exclude NaN data (null) from the avg() calculation. Per chatGPT, it gave me this. But I got an error below. I googled, but no good hint. Any idea?
bad_data: 1:19: parse error: binary expression must contain only scalar and instant vector types
@grant2, can you help me with the enhancement below? What panel type and how the promQL looks like?
I’ve got my % uptime graphs up, working and looking great. I define a ‘interval’ variable, which user can select, to dynamically see the % uptime up values.
Now, as an enhancement to my graphs. I now want a small panel on top that provides a count of the ‘down’ value, for a specific compute interval (ie, 10d, 30d) selected by the user.
If you are using this exporter, then yes, I would say that can collect (in a table or otherwise) all the timestamps when the server was reset. I do not use this exporter, but am sure you can ask further questions on the author’s github page.