DatasourceError with no information on what triggers it

pestertom77 · October 1, 2024, 3:09pm

Hi,

We are getting a DatasourceError about every day but are failing to find the root cause.

Our thinking is that this is due to a short network outage or a timeout.
The problem is that we don’t find any trace of it in the logs even if we log Grafana in debug mode.

Is there any other way to zoom in on a root cause? Do we need to configure Prometheus also in a certain way or increase the logging there?

Kind Regards, Tom

jangaraj · October 1, 2024, 4:08pm

Configure alert state history:

You should find more details why alert went to datasourceerror state in these alert history logs - e.g. timeout, connectivity issues, …

dawiddebowski · October 1, 2024, 7:27pm

You might also find metrics of Grafana helpful. They can show the datasource response times, http errors (if any). Also if you don’t want to receive DatasourceError alerts, you might want to check other options in Configure no data and error handling toggle in alert creation screen. By default it’s No Data / Error, so when only one no data / error happens, alert is firing. You might want to change it to OK / Alerting / Keep Last State.

pestertom77 · October 2, 2024, 8:21am

Hi dawiddebowski,

| http errors (if any)
I saw that you can get nice visualisation on the alerting timeline but does this show any more information that it can get from the logs? I suspect not right. Loki ingests these logs so it doesn’t add information AFAIK.

| Also if you don’t want to receive DatasourceError alerts, you might want to check other options in Configure no data and error handling toggle in alert creation screen

We are doing that at the moment. But this masks connection problems so is only a temporary solution

Kind Regards, Tom

dawiddebowski · October 2, 2024, 2:15pm

I don’t know about the visualization, I haven’t got Loki installed, so I didn’t see that. Grafana metrics can be helpful, as those show the query time etc. but I’m not sure they are in the timeline. We use those metrics to monitor our datasource (you can also check if maybe your datasource exposes some metrics).

As for the connection problems being masked - everyone I’ve ever worked with (my professors in college and colleagues at work) said that connectivity issues are something certain in distributed systems. If they are really frequent (I think I missed the part if those are or not), I’d resolve to metrics exposed by Grafana and your datasource to ensure / exclude if the network is the cause. Also, how many queries are there? Maybe your datasource is not powerful enough? (more CPU and memory?)

pestertom77 · October 3, 2024, 7:39am

| Grafana metrics can be helpful, as those show the query time etc.
That is what I get from the link you sent, it has a nice visualisation but doesn’t proved extra information about the root cause

| As for the connection problems being masked - everyone I’ve ever worked with
True, you have to prepare for failure. We can do this with retries or increase timeouts but what I am asking here is how to enable logging on the issue so we know what is causing this

| If they are really frequent
It happens about once a day. Problem is that when it occurs at night the oncall engineers are woken up…

pestertom77 · October 4, 2024, 12:08pm

We found the log line that reports the network failure
logger=ngalert.scheduler rule_uid=fdwvfo9oxnvgge org_id=1 t=2024-10-03T09:02:24.03889848Z level=error msg=“Failed to evaluate rule” version=10 fingerprint=0e0f94072e47ac8e attempt=1 now=2024-10-03T09:02:20Z error=“the result-set has errors that can be retried: [sse.dataQueryError] failed to execute query [A]: ReadObject: expect { or , or } or n, but found S, error found in #1 byte of …|Service Una|…, bigger context …|Service Unavailable|…”

The network is down for about 15-20 seconds.
The “max_attempts” is set to 1 by default and I have bumped to 40 to survive these blips

github.com

grafana/grafana/blob/main/conf/defaults.ini#L1324


      
          ha_push_pull_interval = 60s
          
          # Enable or disable alerting rule execution. The alerting UI remains visible.
          execute_alerts = true
          
          # Alert evaluation timeout when fetching data from the datasource.
          # The timeout string is a possibly signed sequence of decimal numbers, followed by a unit suffix (ms, s, m, h, d), e.g. 30s or 1m.
          evaluation_timeout = 30s
          
          # Number of times we'll attempt to evaluate an alert rule before giving up on that evaluation. The default value is 1.
          max_attempts = 1
          
          # Minimum interval to enforce between rule evaluations. Rules will be adjusted if they are less than this value or if they are not multiple of the scheduler interval (10s). Higher values can help with resource management as we'll schedule fewer evaluations over time.
          # The interval string is a possibly signed sequence of decimal numbers, followed by a unit suffix (ms, s, m, h, d), e.g. 30s or 1m.
          min_interval = 10s
          
          # This is an experimental option to add parallelization to saving alert states in the database.
          # It configures the maximum number of concurrent queries per rule evaluated. The default value is 1
          # (concurrent queries per rule disabled).
          max_state_save_concurrency = 1

One thing to note is that a retry happens every second whereas I would have expected an (exponential) back off strategy

Topic		Replies	Views
Triggering a datasource error from a different alert name Alerting alerting	9	87	June 18, 2025
Alerts fire a DatasourceError regularly Grafana Cloud alerting	9	1754	August 30, 2024
DataSourceError alerts with PostgreSQL, HA and no error on logs Alerting	1	209	March 2, 2024
Intermitent alertname="DatasourceError" Alerting email	7	11607	March 7, 2024
Error in Grafana Alerting - Failed getting data source Alerting alerting , datasource	0	497	July 31, 2023

DatasourceError with no information on what triggers it

Related topics