I have the latest kube-prometheus-stack chart installed on my cluster, with Grafana alerts configured for some important metrics (CPU usage, and pod status for example).
My issue is that every time I restart the prometheus pod or it just restarts, my alerts go into the DataSourceError state and send a (seemingly) immediate alert notification to my discord channel which looks like there are a multitude of errors even though they are all kind of false positives.
Is there a way to set the global timeout for this Error state to be a larger value? For example 5 minutes, so my prometheus can restart and read its WAL without alerts flooding my channel?
If there is a way to set this value can you point me there?
Hi, I don’t know about a value but when I had datasource problems, we decided to set the Alert state if execution error or timeout to Alerting / Keep Last State setting.
Setting Error will fire the alert as soon as the first error was recorded and I’m not sure if that’s something you can configure. However, setting Alerting would be treated like a single breaching of the threshold. I’d recommend setting Keep Last State since it came back
Beside changing the NoData/Error state, you could also implement a notification policy to handle the DataSourceError alerts for the alert rule or for all alert rules. Then, configure the notification policy to decide how to group the DataSourceError alerts and when to deliver them.