Request canceled (Client.Timeout exceeded while awaiting headers)

delaramhamraz · July 16, 2024, 2:47pm

Problem Description:

I have a complex Prometheus query that runs successfully and quickly in Thanos but consistently times out in Grafana on big time range like 120 days. The query is as follows:

avg(sum_over_time(
  (sum by(container)(kube_pod_container_status_running{namespace=~"keycloak|dev", container=~"keycloak|test-server|test-web"}) >= bool 1)[$__range:]
) / count_over_time(
  (sum by(container)(kube_pod_container_status_running{namespace=~"keycloak|dev", container=~"keycloak|test-server|test-web"}))[$__range:]
) * 100)

with this query I want to get the global availability of my applications.

Observations:

The query works perfectly in Thanos without timeouts.
Grafana still times out despite increasing the timeout settings.
The memory usage of thanos-storegateway is high, but it’s managed within the increased resource limits.

Why does the query time out in Grafana but not in Thanos?
Are there other settings in Grafana or Thanos that I should adjust?
or is there a way to simplify the query ?

I would be really thankful if anyone have any ideas help me

codi639 · July 16, 2024, 3:22pm

Hello @delaramhamraz welcome to the forum,

Did you try to increase the Grafana timeout in the grafana.ini or directly in the datasource settings?

delaramhamraz · July 16, 2024, 3:36pm

Hello Thank you
yes I did. but i have the same problem. although if the timeout did work, I would still have the problem of waiting too long for the data to be loaded . since i want to present this dashboard to other people, I can’t have them wait that long

codi639 · July 16, 2024, 8:59pm

Yeah it make sense haha.

I don’t really have a solution right now but some ideas…
First of all, but I assume you got this, make sure your servers (are you using multiple server?) have sufficient ressources (RAM or core for example).

Your queries looks very complex so, I’d say, try to find a way to simplify them at maximum.

I don’t know if that’s possible, but maybe you can try to make your Thanos service handle big queries in a different way, which would help.

In the same idea, you can try to build multiple table, which would be filled by the result of some intermediate queries. Then build a simplified query?

Those are only ideas. I’ll take a better look in two days, but I can’t promise a solution!

Best regards

jangaraj · July 16, 2024, 9:21pm

I would optimise default query options, e.g.:

delaramhamraz · July 17, 2024, 10:24am

Thanks for the ideas,

I would really much like to simplify the query, but i don’t know how really. i just want to show the global availability of the applications, and it wasn’t me who wrote the query. so i’m not sure if i understand really the query

delaramhamraz · July 17, 2024, 10:26am

thanks for the response. I tried with min interval 1 day but i still have the same problem. so just for trying i put it at 5d and it worked, but doesn’t that mean that i am losing a lot of data points? and the percentage that i am showing on the graph Gauge, is untrustworthy ?

jangaraj · July 17, 2024, 10:30am

That doesn’t sound very good: create timeseries with expensive query, when you just need single number for gauge panel.
I would say you can use even longer period there. In theory 120d.

delaramhamraz · July 17, 2024, 10:43am

Do you think it’s necessary to divide the sum of my running pods by the number of times that prometheus searched for the metrics ? i can’t quite understand the division here

jangaraj · July 17, 2024, 10:55am

I don’t understand your query/metric logic, so I don’t know.

saravahdatipour · April 2, 2025, 9:13am

@delaramhamraz
Hi,
did you end up figuring out what the issue was?
We’re having the exact same problem, and our Loki and Grafana is externally managed by our cloud provider and the support team can’t find anything wrong with it, yet we’re experiencing these timeouts. Sometimes they don’t timeout, but it’s just very inconsistent and unreliable.

Topic		Replies	Views
Grafana timing out when querying Prometheus datasource Prometheus	1	8972	August 5, 2021
Error: Status: 504:Message: Get "http://Loki-querier / Get "http://thanos-querier" Configuration	1	356	July 8, 2024
Rate query failing only on Grafana Grafana query-help , thanos	4	829	November 8, 2024
Net/http: request canceled (Client.Timeout exceeded while awaiting headers) alerts after upgrade Alerting alerting	1	2958	June 20, 2024
Grafana Loki Timeout and 504 Gateway Error on Query Execution Grafana loki , configuration , config-help , kubernetes , grafana	1	678	November 7, 2024

Request canceled (Client.Timeout exceeded while awaiting headers)

Problem Description:

Observations:

Related topics