For the new unified alerting, what is the ideal way of catching errors that are due to the datasource?
For instance, most (if not all) alerts are set to “alerting” for the “execErrState” field, which means we will get every alert using a datasource, if the datasource is erroring. One way around this is to create and manage an alert for each datasource to catch the error and then set all other alert’s execErrField to “OK”, so we get one alert for a downed datasource as opposed to how ever many alerts use the datasource.
I’ve also seen in the docs that we can set the field to “error” and although the alert would now give us a DatasourceError (according to the docs), but wouldn’t you still get flooded by each individual alert, should you have many alerts setup using the datasource?
Unless of course I’ve misunderstood the above. The main issue is receiving tons of alerts in the instance of the same datasource issue. Thanks in advance
There are two issues here, on the one hand the spam sent by alerts which use a datasource which errors out or is not available anymore.
On the other hand, only some metrics from that source might not be available because the actual source used by (to give a very common example) prometheus might be very different. So it’s really hard to split this logically/practically.
One of the things that can improve things slightly is setting the alerts’ no data/error handling to ‘alerting’. The advantage is that the alerts don’t get triggered immediately (after 1 minute in my case with discord), but it takes whatever the pending time is for that alarm. So that’s useful.
Unfortunately this doesn’t solve the spam issue where a data source might error out or might be completely unreachable, in which case all alerts bound to that datasource would start going off and spam the destination channel. So I’m not sure how you can reasonably satisfy both these needs.