Error with alerting from mimir datasource

Hi everyone,

i’m having some troubles with alerting on a mimir datasource, sometimes i randomly get this error:

	[sse.dataQueryError] failed to execute query [A]: Post "http://mimir-nginx.observability.svc.cluster.local/prometheus/api/v1/query": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

that trigger a notification, and then resolve itself after five minutes.

Didn’t notice any particular error on mimir side. What could it be?

Can i post something else that might help understand this?

Thanks

  • Grafana 12.1.1 on kubernetes, datasource is grafana mimir in the same cluster

  • Simple alerting on some metrics

  • Notification fires due to datasource error

  • No datasource errors, fire only when the status change

  • error is the same in UI and logs: [sse.dataQueryError] failed to execute query [A]: Post “http://mimir-nginx.observability.svc.cluster.local/prometheus/api/v1/query”: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

This seems to be something related to the network. It looks like you access the Mimir via k8s service, perhaps, if there are many pods behind this service, it could forward traffic to an unavailable or unhealthy node. I guess it also could be related to high load on the Mimir cluster.

  1. Try check K8s events around the time of the error
  2. Play with Grafana settings:
    1. [unified_alerting].evaluation_timeout, default 30s, increase it to 60s
    2. Under [dataproxy], set timeout = 60. This applies to backend HTTP requests
    3. [unified_alerting].max_attempts, default 3, try to increase to 5

Also, you can alleviate errors and set the rule to “Keep Last State” on errors