We use Grafana to manage our logs, we use alerts to be notified for every ERROR log.
We use this expression: sum by (line) (count_over_time({swarm_service="api"} |= "ERROR" | pattern "<line>" [2m])) > 0
with “Alert evaluation behavior” of 1min.
The “Rule group evaluation interval” is 1 min.
This leads to three alerts being sent for a single error line. We tried tweaking the alert duration or count_over_time duration, but this didn’t lead to changes.
We are also not able to assess the impact of changing duration values, it seems to have no measurable impact, e.g. changing “Alert evaluation behavior” to a shorter or longer value. We also couldn’t find any documentation that is helping here.
We use Grafana Cloud and this used to work, but updates seem to have broken it.
We believe this is a bug in Grafana Cloud, but greatly appreciate help to try to debug this.
Hi, im having the same issue with Amazon managed grafana. Every Alert that i created is firing 3 times, sending 3 messages.
Did you manage to fix this issue?
I also see one alert in Grafana UI, but three notifications.
My most reliable way of reproduction is to have 2 instances of an alert overlap in time, so alert instance 1 is still firing while alert instance 2 moves from pending to firing.
This is an example of two logging lines, each triggered one alert instance. One was printed at 2023-05-17 17:59:33,066, the other at 2023-05-17 17:59:53,719 (20 sec apart)
These are the notifications I received (both over email and slack integration):
Thanks! Looking at the screenshots, is it possible you are using $values in a custom label? That would explain the behaviour you are seeing!
You should also avoid using the value of the query in labels because it’s likely that every evaluation of the alert will return a different value, causing Grafana to create tens or even hundreds of alerts when you really only want one.
Are you running Grafana in HA mode? If so, alerts will be evaluated once per replica (which would explain seeing each alert 3 times in the screenshot), but just one notification should be sent. If 3 notifications are being sent for the same alert then I think Grafana has been misconfigured. I see you are using Amazon Managed Grafana, and I don’t know if they use HA or not.
Can you share a screenshot of the firing alerts in Grafana UI, and also the notifications? I would like to see the labels to understand if these are different alerts (from the same rule) or duplicated notifications for the same alert.
Hi! Thanks for the screenshots! I think I understand where the confusion is here.
I think this is working as intended. You are asking Grafana to create an alert for each ERROR log. If I look at the screenshots Grafana is doing just that. You have two alerts: the first alert is for an error log at time 2023-05-19 08:34:45,879 and the second alert is for a different error log at time 2023-05-19 08:35:28,697.
I think the question is then why does each email contain both alerts? The answer is because that’s how grouping is configured in your Alertmanager configuration. If you want one alert per email you’ll need to disable grouping by changing it to Disable (...).
Do you know if you are using Grafana Managed Alerts or Mimir alerts? The first screenshot looks like Grafana Managed Alerts to me, but I just wanted to check.
If you are using Grafana Managed Alerts, are you using the Grafana Cloud Alertmanager? The emails look like you are, but again I just wanted to check.
If both 1 and 2 are correct, did you select a preference in “Sends alert to”, and if so which one did you choose? You can find this in the Admin page under Alerting.
We have them configured under “Mimir / Cortex / Loki”.
We still have one GrafanaCloud alert configured, but this is for a different service and state history shows no state changes for the last 6 months.
We configure everything using the Browser UI. Alerting > Alert rules (domain.grafana.net/alerting/list). We have a loki datasource sending us logs, the alerts are configured on that source.
I don’t think we are using this. We haven’t selected a preference there.
We end up changing the group wait from 1s to 30s and the group interval from 1s to 5m (the default values). This seems to have fixed it (has been running for a few months now, without duplicate alerts).
I’m not really sure what these configs even do, since we have grouping disabled. Spoke a bit with customer support, but the conclusion was “increase the numbers”, which seems to have helped in our case.