Intermitent alertname="DatasourceError"

Hello, we’re on Grafana ver 10.0.3, Prometheus ver 2.2.71

We’re experiencing intermittent alert message failures. I have an email alert configured to alarm with a two part message - that a threshold has been exceeded, giving the value of that threshold (static portion of the message), and to report the instant value of the metric in question (dynamic portion of the message). This is a sample good notification:

Slurm DBD Queue Size has exceeded threshold of 100: Value is: 5218

However, often we get a message like this:
Slurm DBD Queue Size has exceeded threshold of 100: Value is: [no value]

Thereafter a resolved message is issued. At next firing, a message with a value, or [no value] will issue. [no value] seems to occur at random. Here’s a grafana log entry associated with a [no value] message" (my bold)

logger=alertmanager org=3 t=2023-09-15T13:19:40.018363511-04:00 level=debug component=alertmanager orgID=1 component=dispatcher aggrGroup=“{}/{DBD_QUEUE="100"}:{DBD_QUEUE="100", alert_rule_uid="UmxQn1vVk", alertname="DatasourceError", datasource_uid="8A6i03eMk", grafana_folder="Slurm", ref_id="E", rulename="Slurm_dbd_queue_size"}” msg=flushing **alerts=[DatasourceError[**5593591][active]]

What would be the reason that Grafana recognizes an alert condition, issues the alert, but sometimes does not include the alert value, but sometimes does? Is there a timeout condition occurring – can I extend the timeout so that the alert includes the value? I have two alerts behaving this way.

Thanks for any help.

Hi! :wave: A DatasourceError alert occurs when Grafana cannot query the datasource successfully for a given evaluation. There should be an error message in your logs explaining exactly why this happened. For example, Prometheus could have taken > 30 seconds to answer the query causing a timeout, the server could have been down or restarted, there was a network issue etc.

Your alert notification says [no value] because when Grafana doesn’t get an answer from the datasource a value cannot be computed.

Thanks for coming back so quickly, George. I went back to the log, found this near the datasourceerror:

level=warn msg=“Tick dropped because alert rule evaluation is too slow” rule_uid=UmxQn1vVk org_id=1 time=2023-09-15T13:19:30-04:00

Tick?

Also

now=2023-09-15T13:19:10-04:00 rule_uid=UmxQn1vVk org_id=1 t=2023-09-15T13:19:40.001543221-04:00 level=error msg=“Failed to evaluate rule” error=“failed to execute query E:
Post "http://<prometheus_server_fqdn>:9090/api/v1/query_range": net/http: timeout awaiting response headers (Client.Timeout exceeded while awaiting headers)” duration=30.000839859s

Not sure what to do with either.

This means that it’s taking too long to evaluate the rule when compared to the evaluation interval chosen when creating the rule. For example, if you want the rule to be evaluated every 10 seconds, but the query takes 30 seconds and then timesout, it’s impossible for Grafana to do 30 seconds of work every 10 seconds as the work will accumulate forever. In this case, Grafana drops later evaluations and is referred to as dropped ticks.

This means the Prometheus query is too slow. Probably there is too much data being queried for Prometheus to answer in 30 seconds. Can you reduce the amount of data being queried?

Too much data queried by the Alerts or too much data being queried by Prometheus overall? Would not know how to limit query by the Alert as the example giving is already only for one metric from only three nodes. The other Alert that fails is one node being queried for one metric. That’s already not much querying. If your referring to queries by Prom overall – cut back on our whole monitoring scheme so that two alerts provides data consistently?

And that begs the question – why does Alert notification work correctly sometimes, but not always. Shouldn’t it consistently fail if “too much” data was being queried? Can we increase the timeout?

Too much data queried by the Alerts or too much data being queried by Prometheus overall?

Since it sounds like the timeout only occurs some of the time, I suspect the problem was too much load on the Prometheus server at the time the error occurred. That can be due to a number of reasons. There might have been higher number of queries than normal being executed on the Prometheus server increasing load, there might have been an increase in series that meant the query returned lots of series. I’m afraid it’s impossible to tell from here, the only way to really find out for sure is to make sure you have monitoring on your Prometheus server and then use it to figure out what is happening around the times these errors occur.

Can we increase the timeout?

Yes sure. You’ll want to increase the timeout in the Grafana configuration file https://github.com/grafana/grafana/blob/main/conf/defaults.ini#L1104 and also in the datasource configuration in the user interface:

I would suggest to set error handling to alerting and then configure For config, so at least 2 consecutive errors must happen to generate alert.

@jangaraj

Im getting same alert even after set nodata=ok
how you define this 2 consecutive errors in alert rules ?