I am trying to configure an alert which fires whenever the slope of this queue is not negative. How would I do that?
I saw this post before which uses the diff() operator in the expression but that approach doesn’t seem possible anymore in Grafana 9 (I can’t make the query a range, I can only select the series “A”)
I do not think the built in math functions in Grafana allow one to alert on the rate of change (slope) of a time series. Best to go to the tools offered by your datasource.
You can use a time offset to get the change in queue size over a time window. Many of the standard Grafana functions will assume the DLQ count metric is a counter, rather than a gauge, and so will assume a reduction is queue size is simply a reset of the counter. As counters theoretically only go up but in practice get reset (eg. software restarts), this will give results that can be extremely inaccurate.
For an alert that has a positive slope over a 10m window, you would write an expression such as:
(dlq-count - dlq-count offset 10m) > 0
You can make the alert percentage-based by dividing by the current queue size. This example would alert if the queue size increases by more than 1% over a 10min time window.
If you have an acceptable failure rate (eg. 0.01%), this could be extended to become an error rate alert by calculating the current error rate. This requires knowing how many successful messages were processed as well. This can be useful to set multiple alert with different priorities (eg. P1, P2, etc) based on the error rate, so you know how fast you should intervene, if at all.