How to calculate the number of requests in a time period using PromQL

Hello, community!

I am building a dashboard in Grafana to monitor the latency and the number of requests made to a specific API. The metrics are being collected via Google Cloud Managed Service for Prometheus and accessed in Grafana using Google Cloud Monitoring as the data source.

However, I am facing an issue with how the data is being displayed. The metric I am using to calculate the total number of requests is returning the cumulative number of requests since the start of data collection, regardless of the time period selected in the dashboard.

For example, when I apply a filter to display data from the last 5 minutes, the panel still shows the total number of requests accumulated from the start of data collection, instead of displaying only the total for the selected 5-minute interval.

Below is an illustrative image:

What I need is to calculate the number of requests only within the selected time period in Grafana. Is there a way to adjust the PromQL query so it returns the correct data based on the selected time range? I’ve tried several approaches, but the query always returns the cumulative value of all requests.

Currently, I am using the following PromQL query to calculate the metric total. Does anyone know how I can modify it to return only the requests from the selected time range?

sum(coleta_online_request_count{product_id=~'$product_id', external_bases_name=~'$base_name'})

Here is the configuration for exporting my metrics. Every time a request comes in, I add the value 1 to the metric using the incBy function:

# HELP coleta_online_request_count gauge coleta online request
# TYPE coleta_online_request_count gauge
coleta_online_request_count{product_id="509",base_name="Mock 1",request_status="success"} 1
coleta_online_request_count{product_id="510",base_name="Mock 2",request_status="success"} 1
coleta_online_request_count{product_id="521",base_name="Mock 1",request_status="error"} 1
coleta_online_request_count{product_id="521",base_name="Mock 1",request_status="success"} 1

Any suggestions or ideas would be greatly appreciated. I have been trying for hours without finding a satisfactory result.

Thank you in advance for your help and support!

Best regards,
Reinan

looks like you’re using a gauge metric to track the number of requests, which is why you’re seeing cumulative values. To get the number of requests within a specific time range, you should use a counter metric instead. However, since you’re using a gauge, you can still calculate the rate of change over a time period using the rate() function in PromQL.

Try modifying your query like so

sum(rate(coleta_online_request_count{product_id=~'$product_id', external_bases_name=~'$base_name'}[5m]))

rate() calculates the per-second rate of requests over the last 5 minutes. sum() aggregates these rates across all matching time series. Hopefully this gets you requests/second, which you can multiply by 300 (the number of seconds in 5 minutes) to get the total number of requests in the last 5 minutes (or other time interval)

Hello, thank you very much for your response and detailed explanation! I will take a closer look at implementing the counter metric.

I conducted an experiment using rate() with the gauge metric I currently have, but for some reason I haven’t been able to identify yet, the values always return zero. However, when I perform the summation without rate(), I can see the values in the table as expected.

In any case, I really appreciate your help, and I’ll follow your recommendation to run experiments using a counter metric. Thank you so much for the guidance!

1 Like

Hi,
I’d like to put in my two cent’s worth :smile:

1. Gauge vs counter

The main difference between gauges and counters is that gauge can decrease, counters can only increase or be reset to 0. Using counters for numbers of HTTP requests is highly recommended. Why not gauges though? Imagine a situation, where your data is being scraped every minute. You might have multiple values set to the gauge that you won’t ever see. Counters will only grow so it doesn’t matter if you get the measurement in 30th second or 45th second, while with gauge your measurement could spike and go low many many times. HOWEVER, here you’re using the gauge as a counter, which I would strongly recommend rewriting to counters.

2. Rate function

As David pointed out, when you have cumulative values with counters, you can only see the increase of the counter. That’s where rate function comes in handy. It can calculate increase of the counter in given time frame (I’ve seen lookbehind window somewhere and I took quite a liking to that name :smile:). So a query like this:

coleta_online_request_count{product_id=~'$product_id', external_bases_name=~'$base_name'}[5m]

Would say “take the coleta_online_request_count series that also contains specified labels and give me all the points of those metrics in last 5m” (see the screen below):

Now, we can see that rate (which is (last value in time - first value in time) / time between those points) would be one, since (241 - 1) / (240) == 1 (the denominator is four minutes, since that’s the time between the last and first point). If you want the number of requests, not the rate, you can use increase function, which acts like rate but without dividing by the time.

Both rate and increase functions should only be used with counters (or in your case - gauges behaving like counters), since they both assume that if the value is lower than the last point, it means that the counter did reset and they “shift” the reseted data points up. With gauges it does some… sheneningans (I’ve seen that once and it wasn’t pretty :smile:). Also a reason why we use sum(rate(... and not rate(sum(... since sum can go up and down (e.g. pod restarts).

3. How to connect the query to Grafana’s time picker?

Sadly, I’m not sure. In clean prometheus datasource, you can use $__range built-in variable which will resolve to duration of the time picker, so your query could look like this:

sum(rate(coleta_online_request_count{product_id=~'$product_id', external_bases_name=~'$base_name'}[$__range]))

I’m not sure if it will resolve in your datasource though (I think it should but…).

4. Why do you have 0 values in the result?

I see one reason for that - because you didn’t have any requests in that time :smile:. Since now you know how the rate function works, you can see at the second provided screen (with just the sum) and notice that all the values are 12. So (12-12) / 240 == 0, that’s why. For tests you could extend the time period to look for the data where you had actual increase in values.

I hope this helps! A little disclaimer - I don’t know how much you know, so I opted for the full explanation from base-up, so if you already know some / all the stuff, you can ignore those parts :smile:

2 Likes

Thank you very much for your help! I would also like to express my gratitude to @davidallen5. @dawiddebowski, your explanation was excellent, as you clearly detailed the points and helped me better understand what was happening.

It took me a while to respond because I was conducting some experiments based on the recommendations you suggested. After analyzing your observations and reviewing some literature on the topic, I realized that the best approach was to use counter metrics, as this is one of the main purposes of this metric type.

I want to share a detail that might be helpful for others in similar situations. At the beginning of my implementation, I noticed that even though the system was receiving requests, the calculated values sometimes appeared as zero. Upon investigating, I identified that this happened because I was using a label that created a unique value for each request.

This prevented the metrics from being aggregated correctly. Once I removed this label, the calculations started working correctly, and the zero values stopped occurring. The changes I made to fix this issue were the key to the solution, so I believe that the label was the main problem.

To calculate the number of requests per second received by the system, I used this formula:

sum(rate(coleta_online_request_total{base_name=~'$base_name'}[${__interval}]))

To calculate the total number of requests, I used the following formula:

sum (increase(coleta_online_request_total{base_name=~'.*'}[${__interval}]))

I performed tests to validate this information by comparing the results with data from calls made through K6, and the metrics matched. An additional detail I noticed, which might be relevant for those using Prometheus on GCP, was the need to replace the [$__range] parameter with [${__interval}]. I am still analyzing the impact of this change, but when using the range parameter, my queries failed and returned errors.

I will continue investigating to see if I need to do anything additional because I am using GCP, since, depending on the value of the ${__interval} parameter, the results may differ slightly from the tests performed on K6.

This issue of metrics is something new to me. Once again, thank you for your help and attention to my case. Your recommendations were extremely valuable.

1 Like