Rate query failing only on Grafana

aranshavit · April 1, 2024, 4:46pm

Hello there
I have Grafana deployed to my k8s cluster as part of the kube-prometheus-stack Helm chart.
It is connected to my Thanos querier as its main datasource (which is connected to various Thanos sidecars).

One of our performance engineers has raised my attention to an issue in Grafana specifically, with the following query (note this is using custom metrics from our apps):
sum (irate(starlord_http_requests_total{container=“starlord-cyber-feed”,namespace=“app”, cluster=“qa-1”}[1m])) by (cluster)

The problem is:
On local prometheus UI, or on Thanos querier UI, running this query works, no problems at all.
But on Grafana (as part of a dashboard, or generally on explore), as soon as we increase the time to >12h, the graph flattens down to 0…
Now, since the query is working just fine on both Prometheus and Thanos Querier, I am left to believe the issue here must be with Grafana
(as Thanos Querier is its datasource, so why would it provide a different response?)

Some example screenshots:
Here is the query set to 1 hour, in both Grafana and Thanos querier, looks all good:

Now, here it is in both, set to 24 hours:

I’ve tried debugging this and haven’t found much, what I did try:

Tried using “rate” instead of “irate”, same issue
Tried changing the datasource’s “scrape interval” to 30s (from the default 15), same issue
Tried updating Promteheus+Grafana+Thanos to latest version

Only lead I did find is this log line, with http status 400, matching my query:
logger=context userId=3 orgId=1 uname=<my-email> t=2024-04-01T16:07:47.268619112Z level=info msg=“Request Completed” method=POST path=/api/ds/query status=400 remote_addr=10.2.1.129 time_ms=16 duration=16.278802ms size=13513 referer=“https://<my-domain>/explore?orgId=1&panes=%7B%22r5m%22%3A%7B%22datasource%22%3A%22P5DCFC7561CCDE821%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22sum+%28rate%28starlord_http_requests_total%7Bcontainer%3D%5C%22starlord-cyber-feed%5C%22%2Cnamespace%3D%5C%22app%5C%22%2C+cluster%3D%5C%22qa-1%5C%22%7D%5B1m%5D%29%29+by+%28cluster%29%22%2C%22range%22%3Atrue%2C%22instant%22%3Atrue%2C%22datasource%22%3A%7B%22type%22%3A%22prometheus%22%2C%22uid%22%3A%22P5DCFC7561CCDE821%22%7D%2C%22editorMode%22%3A%22code%22%2C%22legendFormat%22%3A%22__auto%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-24h%22%2C%22to%22%3A%22now%22%7D%7D%7D&schemaVersion=1” handler=/api/ds/query status_source=downstream

So, perhaps Grafana is sending bad requests to Thanos?

jangaraj · April 1, 2024, 5:10pm

I think that Step: auto is a problem. Try to hardcode it as Thanos, so 345s for 24h range.

aranshavit · April 1, 2024, 5:44pm

Nice catch, thanks! Can’t say I understand it, but seems it indeed is providing usable graph with this change

Not sure if that’s a solution though, as hardcoding it breaks the query for other ranges… any suggestions on what I should do? or is this indeed a Grafana bug that needs fixing?

jangaraj · April 1, 2024, 5:59pm

That auto value is “magic”, which depends on selected dashboard time range. But apparently it doesn’t play very well with Thanos and wider timer ranges. (Maybe you can tweak Thanos to be more friendly to Grafana, dunno, check why Thanos returns 400, so you will know exact root cause).
I guess Thanos has also own, but different “magic”, which calculates resolution.
I would use own time variable (but only in the dashboard, not explore), where you can customize how granular time aggregation should be, e. g. 10,100,1000,… points per graph. Play with that until you reach desired results for any time range.

aleksanderdushku · November 8, 2024, 11:00am

Having encountered a similar issue, I figured that the auto step option works the best with $_rate_interval .
All you need to do is adjust the scrape interval in the datasource so it matches the data that you are scraping. By Default the rate interval will be set 4 times higher than the scrape interval.

Topic		Replies	Views
Request canceled (Client.Timeout exceeded while awaiting headers) Configuration	9	1731	July 17, 2024
Error: Status: 504:Message: Get "http://Loki-querier / Get "http://thanos-querier" Configuration	1	253	July 8, 2024
Small Loki queries slows Grafana to a crawl Grafana loki , datasource	1	195	January 25, 2024
Grafana timing out when querying Prometheus datasource Prometheus	1	8680	August 5, 2021
Grafana do not show retention-resolution 1h on dashboard when query 90days Dashboards dashboard , thanos	8	2790	July 13, 2023

Rate query failing only on Grafana

Related topics