Seeking advice about data source and alerts dependencies

Since switching to Grafana alerting, we experience a flood of notifications when one heavily used datasource failed. All notifications were for DatasourceError that each alert experienced.

I’m seeking advice about how to manage this situation because the flood of notifications actually makes it hard to understand what is going on (although in this case it was pretty clear after the mild panic attack that the whole infrastructure had disappeared).

Would be a good practice to have an alert with very simple query against a datasource and then somehow define that as a dependency for all alerts that used that datasource? I couldn’t find much information if Grafana supports dependencies. Are evaluation groups useful in this case?

It’d be awesome if we had a per-datasource toggle that said something like “stop all alerts that depend on this datasource and send a high priority alert” or some place to define that the datasource has an sort of canonical/major alert that is linked to it and should suppress everything in case that alert fails.

Any thoughts?

In the case of data source failure, alerting creates a special alert with the label alertname=DatasourceError and another label datasource_uid=<uid> that contains the UID of the data source that caused the failure. This behavior is controlled per rule by setting “Alert state if execution error or timeout”. When it is set to Error the alerting will behave the way described above. Options Alerting\OK will switch all the current states to ok or alerting.

Therefore, if you use option ‘Error’ for the setting, you will see the DatasourceError alert. To avoid a thunderstorm of notifications you can create a notification policy with matcher alertname=DatasourceError and groups by labels alertname. If you want different notifications per data source, you can add label ‘datasource_uid’ to the group_by. This way, you will get only one notification per incident.

In Alertmanager (a service that sends notifications) there are inhibition rules, which basically do what you describe (when a specific alert is active, it automatically silences all notification policies that refer to it as inhibitor) but on the notification layer, which means that rules are still evaluated. Unfortunately, currently, this functionality is not available in Grafana Alerts but we have plans on enabling it.

1 Like