Howdy, I’m rebuilding an alerting system which alerts on a load of stats on a load of servers, and also alerts on the abscence of data, both over a time period.
CPU too high for 5 mins → alert!
No CPU stats for 5 mins → alert!
And these are all stateless, implicit queries… “Give me any host with a most recent cpu load stat over 90% load” rather than a hardcoded list of hosts to look for. So we never know if we are missing data for a particular host, or the host just never ever existed.
Here though I’m looking at the subtler logic for what should happen if, for example, the CPU goes nuts, and causes the box to die. We have both alerts potentially, but if the data stops arriving, then the CPU load will be unknown, so what can / will / should happen to that alert? What even IS the ideal scenario, and how would it be achieved? I guess ideal is the least noise for the same resolution. So there are two takes on that.
- IF the CPU load alerts, it shouldn’t resolve until the data positively says it has resolved.
Here, I don’t know what Grafana does with an alert if it stops appearing in the list of problematic hosts. A lack of data = resolved? I thought that the NoData status be handy, but then here it’s not is it? As NoData would mean ZERO results from an implicit query like that which isn’t a useful thing. I feel like I’d rather any threshold alert should just stay in it’s current state until proven otherwise, but I don’t think that’s a thing for grafana, unless my query goes far enough back in time to get the last received value, even if it’s hours and hours ago. Which takes much more load effort on the part of my data sources. And also makes calculating a few thousand 5 minute average much harder too, as we’ve no downsampling.
- The missing data alerts should fire before the threshold alerts? This way if the CPU is high, it’s unlikely to fire before the missing data one anyway, and so will disappear harmlessly into thin air. Unless it took a long time to finally kill the box, and well maybe there are some cases you can’t always account for.
One reason we’re caring about these alerts being created is that they’re going to be creating full on incident tickets automatically, not just a red light. One ticket good, 5 tickets bad.
And I’m unsure if it’s best to return ALL results to Grafana to get 2 Alerts and 1500 Normal, or make the query only give back the Alerting ones in the first place.
Any thoughts on how to tweak Grafana for the best strategy here would be really appreciated.