Hello there
I have Grafana deployed to my k8s cluster as part of the kube-prometheus-stack Helm chart.
It is connected to my Thanos querier as its main datasource (which is connected to various Thanos sidecars).
One of our performance engineers has raised my attention to an issue in Grafana specifically, with the following query (note this is using custom metrics from our apps):
sum (irate(starlord_http_requests_total{container=“starlord-cyber-feed”,namespace=“app”, cluster=“qa-1”}[1m])) by (cluster)
The problem is:
On local prometheus UI, or on Thanos querier UI, running this query works, no problems at all.
But on Grafana (as part of a dashboard, or generally on explore), as soon as we increase the time to >12h, the graph flattens down to 0…
Now, since the query is working just fine on both Prometheus and Thanos Querier, I am left to believe the issue here must be with Grafana
(as Thanos Querier is its datasource, so why would it provide a different response?)
Some example screenshots:
Here is the query set to 1 hour, in both Grafana and Thanos querier, looks all good:
Now, here it is in both, set to 24 hours:
I’ve tried debugging this and haven’t found much, what I did try:
- Tried using “rate” instead of “irate”, same issue
- Tried changing the datasource’s “scrape interval” to 30s (from the default 15), same issue
- Tried updating Promteheus+Grafana+Thanos to latest version
Only lead I did find is this log line, with http status 400, matching my query:
logger=context userId=3 orgId=1 uname=<my-email> t=2024-04-01T16:07:47.268619112Z level=info msg=“Request Completed” method=POST path=/api/ds/query status=400 remote_addr=10.2.1.129 time_ms=16 duration=16.278802ms size=13513 referer=“https://<my-domain>/explore?orgId=1&panes=%7B%22r5m%22%3A%7B%22datasource%22%3A%22P5DCFC7561CCDE821%22%2C%22queries%22%3A%5B%7B%22refId%22%3A%22A%22%2C%22expr%22%3A%22sum+%28rate%28starlord_http_requests_total%7Bcontainer%3D%5C%22starlord-cyber-feed%5C%22%2Cnamespace%3D%5C%22app%5C%22%2C+cluster%3D%5C%22qa-1%5C%22%7D%5B1m%5D%29%29+by+%28cluster%29%22%2C%22range%22%3Atrue%2C%22instant%22%3Atrue%2C%22datasource%22%3A%7B%22type%22%3A%22prometheus%22%2C%22uid%22%3A%22P5DCFC7561CCDE821%22%7D%2C%22editorMode%22%3A%22code%22%2C%22legendFormat%22%3A%22__auto%22%7D%5D%2C%22range%22%3A%7B%22from%22%3A%22now-24h%22%2C%22to%22%3A%22now%22%7D%7D%7D&schemaVersion=1” handler=/api/ds/query status_source=downstream
So, perhaps Grafana is sending bad requests to Thanos?