Alerts fire a DatasourceError regularly

michaelschwarz · January 25, 2024, 2:56pm

Hello,

I’ve configured several alerts within Grafana which query data from prometheus (both hosted by grafana in the cloud). All alerts regularly trigger a “DatasourceError”. This happens at least once a day, sometimes more than that. This of course leads to spam in the slack channel all firing alerts are posted to.

The alert logs hint to prometheus not being available or at least taking to long to answer:

2024-01-25 11:59:00.000	 {"schemaVersion":1,"previous":"Normal","current":"Error","error":"last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 10.9.**.***:10000: i/o timeout\"","values":{},"condition":"triggers","dashboardUID":"","panelID":0,"fingerprint":"*****","ruleTitle":"App State","ruleID":1,"ruleUID":"*****","labels":{"alertname":"App State","grafana_folder":"App"}}

The built in Grafana Cloud dashboards tells me that at 11:59:00 for that day all queries had a max latency of 700ms, which I think is reasonable. However, a few minutes after that the latency went up to around 3s.

I hesitate to set the alert to ignore this type of error, because I want to know when there are problems with the monitoring.

Grafana is configured to run in the zone “EU Germany - aws eu-central-1 - prod-eu-west-2 - prod-eu-west-2”. Prometheus runs on “EU Germany - aws eu-central-1 - prod-eu-west-2 - mimir-prod-24”. Currently I’m on the free plan.

Is there any way to fix this?

Greetings,
Michael

Melody · January 25, 2024, 8:16pm

Hi @michaelschwarz! If you haven’t already done so, please open a support ticket so the team can take a closer look. It’s possible there is a problem with the alert rule configuration however the metrics instance is a managed data source so I’d like to make sure there are no unknown issues there.

wxs685 · January 31, 2024, 8:47am

Is there a solution to this problem? We also receive a DatasourceError at irregular intervals with the message "last connection error: connection error: desc = “transport: Error while dialing: dial tcp 10.9.x.x:10000: connect: connection refused”. We have 3 alert rules for a Prometheus data source in Grafana Cloud.

Greeting
Titus

michaelschwarz · January 31, 2024, 9:05am

There is no solution yet. I’ve opened a ticket and I’ll post an update here when there is a solution

wxs685 · February 6, 2024, 8:27am

We had to open a ticket too last week. The Grafana support “added the case to their escalation”.

technatelogy · February 28, 2024, 2:39pm

I have this too. It stopped for a week or so but now it’s back… just randomly during the night for about ~20 mins or so.

michaelschwarz · February 29, 2024, 9:23am

We had a call with Grafana Labs last week, there are issues with DNS, which they are trying to fix

katoquro · March 18, 2024, 1:51pm

Just want to add my case. The same behavior started 2 days ago. I use AWS Managed Prometheus.

nimishambab · July 19, 2024, 1:22pm

Hello, I just started facing this issue after recently. I use Grafana 11.1.0
Is it not resolved yet?

hrishikeshbman · August 30, 2024, 3:21pm

Along similar lines, I am not able to add an alerting rule. The datasources are working and configured via alloy. (This is grafana cloud)

Topic		Replies	Views
Intermitent alertname="DatasourceError" Alerting email	7	11607	March 7, 2024
Grafana "could not find data source" for Prometheus Configuration	4	17281	December 30, 2017
DatasourceError with no information on what triggers it Prometheus alerting , datasource	6	738	October 4, 2024
Error setting up alert against Prometheus data source Prometheus	6	3554	July 26, 2019
Lots of DatasourceErrors for Cloud Loki and Prometheus Grafana Cloud alerting	2	1217	June 24, 2024

Alerts fire a DatasourceError regularly

Related topics