Grafana Alerts getting sent multiple times

We use Grafana to manage our logs, we use alerts to be notified for every ERROR log.

We use this expression:
sum by (line) (count_over_time({swarm_service="api"} |= "ERROR" | pattern "<line>" [2m])) > 0

with “Alert evaluation behavior” of 1min.
The “Rule group evaluation interval” is 1 min.
This leads to three alerts being sent for a single error line. We tried tweaking the alert duration or count_over_time duration, but this didn’t lead to changes.

We are also not able to assess the impact of changing duration values, it seems to have no measurable impact, e.g. changing “Alert evaluation behavior” to a shorter or longer value. We also couldn’t find any documentation that is helping here.

We use Grafana Cloud and this used to work, but updates seem to have broken it.

We believe this is a bug in Grafana Cloud, but greatly appreciate help to try to debug this.

Thanks.

2 Likes

Hi, im having the same issue with Amazon managed grafana. Every Alert that i created is firing 3 times, sending 3 messages.
Did you manage to fix this issue?

No, sadly not. I tried to find some pattern or mistake, but wasn’t able to.

Still hopeful some community member has an idea :slight_smile:

Do you see three alerts in Grafana UI, or just three notifications?

Its a single alert with three evaluations and alerts sent out. Ive crated an issue with screenshots over here: Alerting: Alerts evaluated 3 times and sending 3 notifications · Issue #68652 · grafana/grafana · GitHub

I also see one alert in Grafana UI, but three notifications.

My most reliable way of reproduction is to have 2 instances of an alert overlap in time, so alert instance 1 is still firing while alert instance 2 moves from pending to firing.

This is an example of two logging lines, each triggered one alert instance. One was printed at 2023-05-17 17:59:33,066, the other at 2023-05-17 17:59:53,719 (20 sec apart)

These are the notifications I received (both over email and slack integration):

[FIRING:1]  (API Error 2023-05-17 17:59:33,066 - ERROR - Trigger test error)
[FIRING:1]  (API Error 2023-05-17 17:59:33,066 - ERROR - Trigger test error)
[FIRING:1]  (API Error 2023-05-17 17:59:33,066 - ERROR - Trigger test error)
[FIRING:1]  (API Error 2023-05-17 17:59:33,066 - ERROR - Trigger test error)
[FIRING:1]  (API Error 2023-05-17 17:59:53,719 - ERROR - Trigger test error)
[FIRING:1]  (API Error 2023-05-17 17:59:33,066 - ERROR - Trigger test error)
[FIRING:1]  (API Error 2023-05-17 17:59:53,719 - ERROR - Trigger test error)
[FIRING:1]  (API Error 2023-05-17 17:59:53,719 - ERROR - Trigger test error)

One was received 5 times, the other 3 times.

I also had this happened when just one instance was triggering and e.g. sending 2 notifications.

Thanks! Looking at the screenshots, is it possible you are using $values in a custom label? That would explain the behaviour you are seeing! :slightly_smiling_face:

You should also avoid using the value of the query in labels because it’s likely that every evaluation of the alert will return a different value, causing Grafana to create tens or even hundreds of alerts when you really only want one.

I don’t think I am using it in labels. We have it in annotations though. This is the Alert rule’s yaml:

alert: API Error
for: 59s
annotations:
  description: '{{ $labels.line }}'
  summary: Error in API procedure
  '': ''
labels:
  '': ''
expr: >-
  sum by (line) (count_over_time({swarm_service="main_api_main_api"} |= "ERROR"
  | pattern "<line>" [2m])) > 0

Thanks for the documentation link and help in general!

Are you running Grafana in HA mode? If so, alerts will be evaluated once per replica (which would explain seeing each alert 3 times in the screenshot), but just one notification should be sent. If 3 notifications are being sent for the same alert then I think Grafana has been misconfigured. I see you are using Amazon Managed Grafana, and I don’t know if they use HA or not.

Yeah thank you, i think you may be right. pretty sure they are using some kind of HA. Hope they can fix it over there :slight_smile:

We’re using Grafana Cloud (different than atze2341) - not sure if I have access to this information.

Hi @grillpfanne! :wave:

Can you share a screenshot of the firing alerts in Grafana UI, and also the notifications? I would like to see the labels to understand if these are different alerts (from the same rule) or duplicated notifications for the same alert.

Thanks!

I fired two test errors and received two notifications for every error.

Errors in Grafana UI:

First mail received with the alerts (I included the received time in the top right, 10:36)

Second mail received with the same errors again, 2 minutes later:

Based on what you wrote earlier, I see the “line=…” label now, but I posted the complete yaml above, so I don’t know where this comes from.

Thank you!

Hi! :wave: Thanks for the screenshots! I think I understand where the confusion is here.

I think this is working as intended. You are asking Grafana to create an alert for each ERROR log. If I look at the screenshots Grafana is doing just that. You have two alerts: the first alert is for an error log at time 2023-05-19 08:34:45,879 and the second alert is for a different error log at time 2023-05-19 08:35:28,697.

I think the question is then why does each email contain both alerts? The answer is because that’s how grouping is configured in your Alertmanager configuration. If you want one alert per email you’ll need to disable grouping by changing it to Disable (...).

1 Like

Hi George,
I’ve disabled grouping, and I still receive multiple notifications per alert:

I again triggered two error logs.

Two instances are firing:

I get 4 emails:

Mail 1: alert 1, sent 12:59:

Mail 2 alert 2, sent 12:59:

Mail 3: alert 1, sent 13:01:

Mail 4: alert 2, sent 13:01:

It’s completely possible that our config is wrong somewhere, but from my current understanding we shouldn’t get multiple notifications.

Thanks!

This is our notification policy:

(could only upload 5 images per post)

Hi! :wave: I understand you are using Grafana Cloud.

  1. Do you know if you are using Grafana Managed Alerts or Mimir alerts? The first screenshot looks like Grafana Managed Alerts to me, but I just wanted to check.

  2. If you are using Grafana Managed Alerts, are you using the Grafana Cloud Alertmanager? The emails look like you are, but again I just wanted to check.

  3. If both 1 and 2 are correct, did you select a preference in “Sends alert to”, and if so which one did you choose? You can find this in the Admin page under Alerting.

  1. We have them configured under “Mimir / Cortex / Loki”.
    We still have one GrafanaCloud alert configured, but this is for a different service and state history shows no state changes for the last 6 months.
  2. We configure everything using the Browser UI. Alerting > Alert rules (domain.grafana.net/alerting/list). We have a loki datasource sending us logs, the alerts are configured on that source.
  3. I don’t think we are using this. We haven’t selected a preference there.

Thanks!

I have the same issue. No special configuration.

AWS Managed Grafana (provisioned Jan 2024).

SiteWise IoT Data source → Grafana Alerts → SNS → Email

Fires 3 times for me as well when threshold is breached and when back to normal.

We end up changing the group wait from 1s to 30s and the group interval from 1s to 5m (the default values). This seems to have fixed it (has been running for a few months now, without duplicate alerts).
I’m not really sure what these configs even do, since we have grouping disabled. Spoke a bit with customer support, but the conclusion was “increase the numbers”, which seems to have helped in our case.

1 Like