Grafana help needed with PromQL weirdness

  • What Grafana version and what operating system are you using?
    Grafana 10.2.0 on Rocky Linux 8.8, Prometheus 2.13.1

  • What are you trying to achieve?
    Show graph of successful keep-alive responses

  • How are you trying to achieve it?
    Using a Prometheus datasource to query the data. The query makes use of the increase() function with 30s as time period to determine the change in 30s, as we should receive a kee-alive response every 30s.

  • What happened?
    The raw data when viewed with Explore shows a 1 period (30s) dip where the counter does not increase. When applying an increase function, the result shows zero values for 6 minutes.

  • What did you expect to happen?
    I would expect the data to show a dip for 1 period (30s) and then the normal value

  • Can you copy/paste the configuration(s) that you are having problems with?
    Raw data query: label_replace(CamelRoutePolicy_seconds_count{routeId=~"sendEchoTo.*",failed="false"}, "legend", "$1", "routeId", "sendEchoTo(.*)")
    Step is set to 30s

Result:


You can see there is a dip at 12:50 in Green and 12:50:30 in Yellow where the value does not increase

Applying an increase function:
Query:
label_replace(increase(CamelRoutePolicy_seconds_count{routeId=~"sendEchoTo.*",failed="false"}[30s]), "legend", "$1", "routeId", "sendEchoTo(.*)")
Step is set to 30s

Result:

The value in the graph dips for 6 minutes, which is obviously not true based on the previous graph.

  • Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.
    Could not find any errors

I would really appreciate any pointers in the right direction, thanks in advance.

As far as I can see is that your expectation is not matching.

Prometheus scrapes you application to retries the data (or data is send with remote write).
But the CamelRoutePolicy_seconds_count metrics data can still be scraped.
And even it was not able to scrape, prometheus will return the last data point for 5 minutes.
Only after that is will disappear.

But in your case I think you application did not do the action. The Camel RoutePolicy has a counter which did not increase in the time period.

If it would be that prometheus missed a scrape and the counter was increased in the meantime, then it would have registered the new counter value and you would have seen a double increase. That is not the case.

So the problem is in you application, not in Grafana or prometheus.

And with the increase, you also have an option to spot these moments were the counter is not increased.

My problem is that the 1st graph shows all the data correctly, but the 2nd graph using the same data with the increase() function applied indicates no increase for 4 minutes, which is not true according to the raw data in the 1st graph.

I found a link New in Grafana 7.2: $__rate_interval for Prometheus rate queries that just work | Grafana Labs that seems to indicate that the minimum range used should be at least 4 times the scraping interval, it seems that the increase() function needs a couple of scrapes in order to operate according to my understanding.

So no, there is nothing wrong with the application, just the understanding of how to use the prometheus increase/rate/delta functions and get reasonable results.

@nickidw
Sorry, somehow I miss understood you question, I focussed on the dip at 12:50 and explanation of the dip itself.

increase(v range-vector) calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for. The increase is extrapolated to cover the full time range as specified in the range vector selector, so that it is possible to get a non-integer result even if a counter increases only by integer increments.

According to the documentation of the increase method there is extrapolation used as well.

Did you try to execute the same query directly in Prometheus UI?
Then you can see if the ‘step 30s’ makes a difference or if this is due to way Prometheus is calculating the results.

No problem. Unfortunately the server sits in the network where I cannot get to the Prometheus UI, but I’ll see what I can do.

I have in the meantime adjusted the range to 5m, which has smoothed out the graph a lot and also got rid of the false alerts, plus, it did trigger last night when there was a real dip.

Thanks for your input.

1 Like