Percent Up time graphs

Please refer to the attached picture.

So, in that picture, there are 3 labels (grayed out, but they are c, n and s). At each time point on the x axis (equating to Prometheus scraping intervals), each label has a metrics value of 1 (up or green) or 0 (down or red). In this example, there are no down metrics.

What I like to have: a graph/panel that has 2 metrics for each label (so 6 total, for all 3 labels) that keeps being updated, depending on the values on the graph I shared here. The metrics are: #1) Percent of up (green) in a month, #2) Percent of up (green) in a year.

So it’s like a table, 3 rows (the labels, c/n/s) and 3 columns (1st column for the 3 labels, 2nd column is for the metrics #1, 3rd column for the metrics #2).

How can I do that?

Appreciate any help/tip you can offer!

grafana (1)

Which visualization type are you using? Status history?

Yes, status history. For the new percentage graphs, it doesn’t matter what graph type, as long as we can see the % up time, that would be great.

@pdn
This is just a quick mockup, but the only way I see this working is to have 6 stat panels and a status history graph with 3 rows. Not sure how to do the % uptime calculation with Prometheus, but I would guess someone has done it already.

1 Like

Thank you @grant2, that’s exactly what I want to have.

Anyone who has done it already, please share how you would do it with prometheus data. I already have data coming in from Prometheus, either 1 (up) or 0 (down), at the pre-defined scraping interval.

Random Google searches reveal a lot of hits:

https://www.reddit.com/r/PrometheusMonitoring/comments/ior5kd/how_to_get_the_uptime_of_a_service_and_to_present/

and ChatGPT:
To calculate the percentage of uptime using a Prometheus query, you can use the up metric, which is a built-in metric in Prometheus that represents the health of a target (instance). The up metric has a value of 1 when the target is up and healthy, and 0 when the target is down. You can use the avg_over_time() function to get the average uptime over a certain time range.

Here’s a sample query to calculate the percentage of uptime for all instances over the last 1 hour:
avg(avg_over_time(up[1h])) * 100

This query will give you the average uptime percentage for all instances being monitored by Prometheus during the last hour. If you want to calculate the percentage of uptime for a specific instance or a group of instances, you can use the instance label in the query:
avg(avg_over_time(up{instance="your_instance_name"}[1h])) * 100

Replace your_instance_name with the actual name or IP of the instance you want to calculate the uptime percentage for.

Similarly, if you want to calculate the percentage of uptime for a specific job or service, you can use the job label in the query:
avg(avg_over_time(up{job="your_job_name"}[1h])) * 100

Replace your_job_name with the actual name of the job or service you want to calculate the uptime percentage for.

2 Likes

Wow, chatGPT too? :slight_smile:

I did google before, but I thought it might be quicker to post in Grafana community, kind of cheating :slight_smile:
Thank you again @grant2! I will look more into it. But anyone already has done before, would love to have it!

I struggled to come up with the top 2x3 graph with % values.

Don’t know how to do it, already tried so different things and different graph types. Would be great to see an example, with detailed how-to’s. Much thanks in advance!

image

Those are 6 separate Stat panels.

1 Like

I think I got it, by manually creating each stat graph.

You had to manually enter the labels (“% uptime for C this month”, etc) for each panel right?

I was hoping for the graph to pick up the labels automatically. But doesn’t seem to be the case.

You can create a variable (e.g. “C”, “N”, etc.) and in the panel title, use a $ to have it reflect accordingly.

1 Like

Thank you so much for your help and tips @grant2! I’ve learned a ton today. Have a great weekend!

@grant2, I have one more question. I want to exclude NaN data (null) from the avg() calculation. Per chatGPT, it gave me this. But I got an error below. I googled, but no good hint. Any idea?

bad_data: 1:19: parse error: binary expression must contain only scalar and instant vector types

From chatGPT:

av(avg_over_time(api_response_time{service="my-service"}[5m] unless api_response_time{service="my-service"} == NaN))*100

It turns out that, the avg_over_time() function does exclude non-value data points. No need to use ‘unless’ whatever.

Just to provide an update.

@grant2, can you help me with the enhancement below? What panel type and how the promQL looks like?

I’ve got my % uptime graphs up, working and looking great. I define a ‘interval’ variable, which user can select, to dynamically see the % uptime up values.

Now, as an enhancement to my graphs. I now want a small panel on top that provides a count of the ‘down’ value, for a specific compute interval (ie, 10d, 30d) selected by the user.

Thank you!

is this available for Windows Exporter?

can i get All the previous up time every time we restart the server

Welcome @jsabat1

If you are using this exporter, then yes, I would say that can collect (in a table or otherwise) all the timestamps when the server was reset. I do not use this exporter, but am sure you can ask further questions on the author’s github page.