This is a similar issue to so many posted here in the past, not sure if identical and just a restating or novel problem.
Dead-man alerts. Node down, as measured by lack of data for a series. Once the look-back period for the search reaches the end of the data that caused the alert, the labels on the alert disappear (because the labels are attached to the missing data).
In the past, I’ve used Kapacitor to alert from Influx DB. Kapacitor creates a matrix in memory of the measurement and labels for each alert rule, and the alert continues to sound, with appropriate labels attached, long after the data has stopped flowing, until Kapacitor is restarted and loses it’s state records.
In Grafana, there are no state records.
For example, if CPU percent used series stops flowing for a host for 5 minutes, looking back 10 minutes, a new alert is generated. That alert contains the hostname, and all other labels associated with the host. But once the host has been down 10 minutes, the labels all disappear, and this appears to our ticketing service as a new, blank, alert. From “hostnameABC, East, PROD, database server is down” to “blank, blank, blank, blank is down”.
Where is state stored in Grafana alerting? Or doesn’t it? How do I scale this to thousands of hosts if there’s no state data? Or do I just need a different alerting engine?