Lots of DatasourceErrors for Cloud Loki and Prometheus

Hi,

I’m planning to migrate our legacy monitoring system to something more modern and want to try out a few stacks.

So I started with the free cloud plan and configured Promtail and Prometheus to push some logs/ metrics to the cloud instances Prometheus and Loki. I also set up a few alerts for a certain message appearing in the logs and some alerts for node exporter metrics, so pretty basic setup for now.

Within the last day, I got three “DatasourceError” notifications (“failed to execute query: … context deadline exceeded”) at different times for the cloud Prometheus/ Loki.

Getting notifications about the monitoring system itself being degraded is something I wouldn’t expect that regularly from a stable cloud offering. Is this normal behaviour/reliability that one would expect from Grafana Cloud? Is there an ongoing issue (the status dashboard has no reported incidents)? Or is it just a perk of the free plan?

The error you are receiving is not a database timeout or degradation error, but rather an alert evaluation query timeout – caused by alert evaluations running too frequently, and then timing out. The default evaluation timeout value set in Grafana Cloud is 30 seconds.

You can mitigate these by hitting the false error state by changing the Alert State for timeouts to Alerting , this will give a greater threshold for the alert to be in breach; see the pending period below:

Now, when the alert evaluation fails, the alert will go into a Pending state, and if the next evaluation is successful, it gets back to Normal and you do not get notified.

You can read more documentation about the different states here.

If the issue persists, you can open a Support ticket with us and we can change the default timeout period to 60s to see if that helps further.

1 Like

@ximenaaliaguilla Not sure if this is outdated or not but the official docs do NOT state that setting “Alert state if no data or all values are null” to Alerting will also reevaluate during the pending time, instead they will be triggered just after the For time (under “All rules in the selected group are evaluated every 5m”).

The alert rule waits until the time set in the For field has finished before firing.

Therefore I would expect the “No Data” alert after 5 minutes, if the next evaluation has also “No Data”.

Where do you have this information from, that “Alerting” will also reevaluate during the pending time?