How to Alert based on Slope of Metric Graph?

I have an SQS Dead Letter Queue I am trying to alert on.
Grafana is getting the data about this queue from Cloudwatch via the Cloudwatch connector.

The queue’s behavior over the past week looks like this:

I am trying to configure an alert which fires whenever the slope of this queue is not negative.
How would I do that?

I saw this post before which uses the diff() operator in the expression but that approach doesn’t seem possible anymore in Grafana 9 (I can’t make the query a range, I can only select the series “A”)

I use InfluxDB and have a similar need and found this to be perfect.

What if I don’t use InfluxDB? Is it possible to do something like rate based alerting using Math expressions?

I do not think the built in math functions in Grafana allow one to alert on the rate of change (slope) of a time series. Best to go to the tools offered by your datasource.

Ah ok. Is it possible for that to be added to Grafana in the future? Rate of change based alerting seems like a pretty basic ask.

Feature requests can be started here:

You can use a time offset to get the change in queue size over a time window. Many of the standard Grafana functions will assume the DLQ count metric is a counter, rather than a gauge, and so will assume a reduction is queue size is simply a reset of the counter. As counters theoretically only go up but in practice get reset (eg. software restarts), this will give results that can be extremely inaccurate.

For an alert that has a positive slope over a 10m window, you would write an expression such as:

(dlq-count - dlq-count offset 10m) > 0

You can make the alert percentage-based by dividing by the current queue size. This example would alert if the queue size increases by more than 1% over a 10min time window.

((dlq-count - dlq-count offset 10m) / dlq-count) > 0.01

If you have an acceptable failure rate (eg. 0.01%), this could be extended to become an error rate alert by calculating the current error rate. This requires knowing how many successful messages were processed as well. This can be useful to set multiple alert with different priorities (eg. P1, P2, etc) based on the error rate, so you know how fast you should intervene, if at all.

((dlq-count - dlq-count offset 10m) / (processed-msg-count + dlq-count) > 0.0001
1 Like