Rate query failing only on Grafana

Hello there
I have Grafana deployed to my k8s cluster as part of the kube-prometheus-stack Helm chart.
It is connected to my Thanos querier as its main datasource (which is connected to various Thanos sidecars).

One of our performance engineers has raised my attention to an issue in Grafana specifically, with the following query (note this is using custom metrics from our apps):
sum (irate(starlord_http_requests_total{container=“starlord-cyber-feed”,namespace=“app”, cluster=“qa-1”}[1m])) by (cluster)

The problem is:
On local prometheus UI, or on Thanos querier UI, running this query works, no problems at all.
But on Grafana (as part of a dashboard, or generally on explore), as soon as we increase the time to >12h, the graph flattens down to 0…
Now, since the query is working just fine on both Prometheus and Thanos Querier, I am left to believe the issue here must be with Grafana
(as Thanos Querier is its datasource, so why would it provide a different response?)

Some example screenshots:
Here is the query set to 1 hour, in both Grafana and Thanos querier, looks all good:

Now, here it is in both, set to 24 hours:

I’ve tried debugging this and haven’t found much, what I did try:

  • Tried using “rate” instead of “irate”, same issue
  • Tried changing the datasource’s “scrape interval” to 30s (from the default 15), same issue
  • Tried updating Promteheus+Grafana+Thanos to latest version

Only lead I did find is this log line, with http status 400, matching my query:
logger=context userId=3 orgId=1 uname=<my-email> t=2024-04-01T16:07:47.268619112Z level=info msg=“Request Completed” method=POST path=/api/ds/query status=400 remote_addr= time_ms=16 duration=16.278802ms size=13513 referer=“https://<my-domain>/explore?orgId=1&panes=%7B%22r5m%22%3A%7B%22datasource%22%3A%22P5DCFC7561CCDE821%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22sum+%28rate%28starlord_http_requests_total%7Bcontainer%3D%5C%22starlord-cyber-feed%5C%22%2Cnamespace%3D%5C%22app%5C%22%2C+cluster%3D%5C%22qa-1%5C%22%7D%5B1m%5D%29%29+by+%28cluster%29%22%2C%22range%22%3Atrue%2C%22instant%22%3Atrue%2C%22datasource%22%3A%7B%22type%22%3A%22prometheus%22%2C%22uid%22%3A%22P5DCFC7561CCDE821%22%7D%2C%22editorMode%22%3A%22code%22%2C%22legendFormat%22%3A%22__auto%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-24h%22%2C%22to%22%3A%22now%22%7D%7D%7D&schemaVersion=1” handler=/api/ds/query status_source=downstream

So, perhaps Grafana is sending bad requests to Thanos?

I think that Step: auto is a problem. Try to hardcode it as Thanos, so 345s for 24h range.

1 Like

Nice catch, thanks! Can’t say I understand it, but seems it indeed is providing usable graph with this change

Not sure if that’s a solution though, as hardcoding it breaks the query for other ranges… any suggestions on what I should do? or is this indeed a Grafana bug that needs fixing?

That auto value is “magic”, which depends on selected dashboard time range. But apparently it doesn’t play very well with Thanos and wider timer ranges. (Maybe you can tweak Thanos to be more friendly to Grafana, dunno, check why Thanos returns 400, so you will know exact root cause).
I guess Thanos has also own, but different “magic”, which calculates resolution.
I would use own time variable (but only in the dashboard, not explore), where you can customize how granular time aggregation should be, e. g. 10,100,1000,… points per graph. Play with that until you reach desired results for any time range.