Problem Description:
I have a complex Prometheus query that runs successfully and quickly in Thanos but consistently times out in Grafana on big time range like 120 days. The query is as follows:
avg(sum_over_time(
(sum by(container)(kube_pod_container_status_running{namespace=~"keycloak|dev", container=~"keycloak|test-server|test-web"}) >= bool 1)[$__range:]
) / count_over_time(
(sum by(container)(kube_pod_container_status_running{namespace=~"keycloak|dev", container=~"keycloak|test-server|test-web"}))[$__range:]
) * 100)
with this query I want to get the global availability of my applications.
Observations:
- The query works perfectly in Thanos without timeouts.
- Grafana still times out despite increasing the timeout settings.
- The memory usage of
thanos-storegateway
is high, but it’s managed within the increased resource limits.
Why does the query time out in Grafana but not in Thanos?
Are there other settings in Grafana or Thanos that I should adjust?
or is there a way to simplify the query ?
I would be really thankful if anyone have any ideas help me
Hello @delaramhamraz welcome to the forum,
Did you try to increase the Grafana timeout in the grafana.ini
or directly in the datasource settings?
Hello Thank you
yes I did. but i have the same problem. although if the timeout did work, I would still have the problem of waiting too long for the data to be loaded . since i want to present this dashboard to other people, I can’t have them wait that long
Yeah it make sense haha.
I don’t really have a solution right now but some ideas…
First of all, but I assume you got this, make sure your servers (are you using multiple server?) have sufficient ressources (RAM or core for example).
Your queries looks very complex so, I’d say, try to find a way to simplify them at maximum.
I don’t know if that’s possible, but maybe you can try to make your Thanos service handle big queries in a different way, which would help.
In the same idea, you can try to build multiple table, which would be filled by the result of some intermediate queries. Then build a simplified query?
Those are only ideas. I’ll take a better look in two days, but I can’t promise a solution!
Best regards
I would optimise default query options, e.g.:
1 Like
Thanks for the ideas,
I would really much like to simplify the query, but i don’t know how really. i just want to show the global availability of the applications, and it wasn’t me who wrote the query. so i’m not sure if i understand really the query
thanks for the response. I tried with min interval 1 day but i still have the same problem. so just for trying i put it at 5d and it worked, but doesn’t that mean that i am losing a lot of data points? and the percentage that i am showing on the graph Gauge, is untrustworthy ?
That doesn’t sound very good: create timeseries with expensive query, when you just need single number for gauge panel.
I would say you can use even longer period there. In theory 120d.
Do you think it’s necessary to divide the sum of my running pods by the number of times that prometheus searched for the metrics ? i can’t quite understand the division here
I don’t understand your query/metric logic, so I don’t know.