Avg-Alerts trigger too late and unnecessarily?

loba · April 27, 2018, 8:47am

Hi all,
we use Prometheus (v 1.7.1) as our data source and configured a Grafana (v4.4.3) graph panel to plot the “up” state for some of our targets (value “1” meaning target is up, value “0” meaning it’s down).

We then added an alert to this graph to be triggered whenever a target became unavailable. However, sometimes Prometheus fails to gather metrics for one of the targets (possibly b/c of timeouts) and flags the up-state of that target as down (scrape interval 60 sec). We tried building an avg()-alert to only be alerted when the target goes away for more than one minute:

Our understanding is as follows: Alert when the average value of any target in query A from 5 minutes ago until now drops below 0.5. Given the 60 sec scrape interval in our Prometheus setup we want to be alerted when a target is unresponsive for more than 2 minutes.

As you can see one target “went down” at 04:35 and came back up at 04:36. Ideally we don’t want to be alerted. However, the above alert triggers shortly after 04:40 and goes back to OK state after 04:41. To us it seems that the average over the past five minutes (at the time of alerting) was about 0.8.

Why does the alert trigger at all and why so late? How do we fix that?

Thanks in advance,
Lorenz

loba · May 3, 2018, 10:49am

Sadly, we still haven’t figured out why grafana alerts work the way they do.

However, we found a work-around for our specific situation: We use avg_over_time(up{...}[5m]) to have the metric graph calculate the average and switched our alerts to trigger when the last() value from 1m ago until now ever IS BELOW 0.5. This does the trick, although it’s not our preferred solution.

Topic		Replies	Views
Grafana Alert firing when it should not Grafana	1	1509	September 28, 2018
Alerts on new items always fire Configuration alerting	1	1239	April 23, 2018
Grafana alert metric returning different values than the graph Grafana	5	3804	October 24, 2018
Display old Prometheus alerts on grafana Prometheus	0	667	May 30, 2019
Unable to alert on a prometheus metric Alerting alerting	1	352	March 22, 2023

Avg-Alerts trigger too late and unnecessarily?

Related topics