Alerts fire a DatasourceError regularly

Hello,

I’ve configured several alerts within Grafana which query data from prometheus (both hosted by grafana in the cloud). All alerts regularly trigger a “DatasourceError”. This happens at least once a day, sometimes more than that. This of course leads to spam in the slack channel all firing alerts are posted to.

The alert logs hint to prometheus not being available or at least taking to long to answer:

2024-01-25 11:59:00.000	 {"schemaVersion":1,"previous":"Normal","current":"Error","error":"last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.9.**.***:10000: i/o timeout\"","values":{},"condition":"triggers","dashboardUID":"","panelID":0,"fingerprint":"*****","ruleTitle":"App State","ruleID":1,"ruleUID":"*****","labels":{"alertname":"App State","grafana_folder":"App"}}

The built in Grafana Cloud dashboards tells me that at 11:59:00 for that day all queries had a max latency of 700ms, which I think is reasonable. However, a few minutes after that the latency went up to around 3s.

I hesitate to set the alert to ignore this type of error, because I want to know when there are problems with the monitoring.

Grafana is configured to run in the zone “EU Germany - aws eu-central-1 - prod-eu-west-2 - prod-eu-west-2”. Prometheus runs on “EU Germany - aws eu-central-1 - prod-eu-west-2 - mimir-prod-24”. Currently I’m on the free plan.

Is there any way to fix this?

Greetings,
Michael

2 Likes

Hi @michaelschwarz! If you haven’t already done so, please open a support ticket so the team can take a closer look. It’s possible there is a problem with the alert rule configuration however the metrics instance is a managed data source so I’d like to make sure there are no unknown issues there.

Is there a solution to this problem? We also receive a DatasourceError at irregular intervals with the message "last connection error: connection error: desc = “transport: Error while dialing: dial tcp 10.9.x.x:10000: connect: connection refused”. We have 3 alert rules for a Prometheus data source in Grafana Cloud.

Greeting
Titus

There is no solution yet. I’ve opened a ticket and I’ll post an update here when there is a solution :slight_smile:

We had to open a ticket too last week. The Grafana support “added the case to their escalation”.

I have this too. It stopped for a week or so but now it’s back… just randomly during the night for about ~20 mins or so.

We had a call with Grafana Labs last week, there are issues with DNS, which they are trying to fix

Just want to add my case. The same behavior started 2 days ago. I use AWS Managed Prometheus.

Hello, I just started facing this issue after recently. I use Grafana 11.1.0
Is it not resolved yet?

Along similar lines, I am not able to add an alerting rule. The datasources are working and configured via alloy. (This is grafana cloud)