I’ve configured several alerts within Grafana which query data from prometheus (both hosted by grafana in the cloud). All alerts regularly trigger a “DatasourceError”. This happens at least once a day, sometimes more than that. This of course leads to spam in the slack channel all firing alerts are posted to.
The alert logs hint to prometheus not being available or at least taking to long to answer:
The built in Grafana Cloud dashboards tells me that at 11:59:00 for that day all queries had a max latency of 700ms, which I think is reasonable. However, a few minutes after that the latency went up to around 3s.
I hesitate to set the alert to ignore this type of error, because I want to know when there are problems with the monitoring.
Grafana is configured to run in the zone “EU Germany - aws eu-central-1 - prod-eu-west-2 - prod-eu-west-2”. Prometheus runs on “EU Germany - aws eu-central-1 - prod-eu-west-2 - mimir-prod-24”. Currently I’m on the free plan.
Hi @michaelschwarz! If you haven’t already done so, please open a support ticket so the team can take a closer look. It’s possible there is a problem with the alert rule configuration however the metrics instance is a managed data source so I’d like to make sure there are no unknown issues there.
Is there a solution to this problem? We also receive a DatasourceError at irregular intervals with the message "last connection error: connection error: desc = “transport: Error while dialing: dial tcp 10.9.x.x:10000: connect: connection refused”. We have 3 alert rules for a Prometheus data source in Grafana Cloud.