Alerting on consistent error states, ignoring spikes and dips

I’d like an email if my site is throwing lots of errors for an extended period of time (using influxdb, if that matters)

  • if(average(errorsPerMinute, 5mins) > 10) doesn’t work, because a single large spike can throw off the moving average
  • if(max(errorsPerMinute, 5mins) > 10) doesn’t work, because a single spike sets it off immediately
  • if(min(errorsPerMinute, 5mins) > 10) doesn’t work, because a single data point without errors will stop the alert from firing even if all the other data points are in error

Any ideas how else to do it?

I think I would like something like “send me an email if 50% of samples in the past 5 minutes are above 10” - that way it doesn’t matter if the site is slightly in error or hugely in error, and it doesn’t matter if the errors contain spikes or dips, I only get an email if most of our recent samples are above the threshold.

I’ve tried to make this happen by creating a query of isInError = errorsPerMinute > 10 ? 1 : 0 in order to get a time series of 1’s and 0’s, then alerting on average(isInError, 5mins) > 0.5 to mean “I am in error more than half the time” - but I can’t get the syntax for that to work with influxdb :frowning:

Perhaps you could add a new Influx query that did a percentile or median on your data, to normalize it? You can then alert on this query with the Grafana alerting engine.

2 Likes

Oh yeah, I guess now that I think about it, “p80 is above threshold” and “80% of samples are below threshold” are equivalent (If I’m understanding maths right?), I shall try that :smiley: