Alerts not getting fired because of Alert Grouping

An alert rule goes into Firing State because of a single or many alert instances contained inside it going in Alerting State. Consider for instance that Alert Rule went into Firing State because three of the alerting instances contained inside it went into Alerting State. If you go in Alerting → Groups you will find that the group exists with 3 instances inside it. Something like this


An alert is also triggered for the same with 3 alerts combined together

Now, consider after some time, again any one of the alert instance went into Alerting state where in the earlier three alert instances are already in Alerting state. For this scenario, its observed that there’s no new alert triggering even though the alert instances got updated from three to four. The alert rule stays in Firing State but there’s no new alert fired.

How can I trigger a new alert whenever there’s a change in the Group to indicate the latest status of the group? Also I don’t intend to remove grouping of alerts.

This is what my default notification policy looks like

Also I m using a notification template which look something like this

{{ define "myalert" }}
    {{ if eq (.CommonAnnotations.Metric) "CPU Utilization" }}
      CPU Utilization Alert. There is/are {{len .Alerts.Firing}} instance/s that has/have exceeded 75% CPU Utilization:
    {{end}}

    {{range .Alerts.Firing}}
      {{template "alertinfo" .}}
    {{end}}
{{end}}

{{define "alertinfo"}}
  {{.Annotations.Message}}
{{end}}

Hi! :wave:

Now, consider after some time, again any one of the alert instance went into Alerting state where in the earlier three alert instances are already in Alerting state. For this scenario, its observed that there’s no new alert triggering even though the alert instances got updated from three to four. The alert rule stays in Firing State but there’s no new alert fired.

I think you might be confusing alerts and notifications. There is a new alert as the number of alerts went from 3 to 4 (as you mentioned).

Your configured Group Interval (15 seconds) means that when a new alert fires, or an existing alert resolves, the Grafana Alertmanager waits 15 seconds and then sends a notification. In your case, this is sent to Pagerduty.

However, if these alerts belong to the same group, then I understand that Pagerduty treats them as the same incident. That means Pagerduty updates your existing incident, instead of creating a new one.

If you want a separate incident per alert you should disable grouping by replacing grafana_folder alertname with ... as shown in the screenshot below:

Hi @georgerobinson. Thanks for replying

If in the group the number of alerts went from 3 to 4 then the PagerDuty incident should have been updated or a new incident should have been created but neither of it is happening. All I can see is the older incident.

Let me illustrate it with the help of a example, for better clarity
So we are grouping alerts on the basis of grafana folder and alertname. Say 2 alerts (consider A and B) are firing since they breached the 75% cpu utilization threshold, so in the incident with the help of notification template, I am displaying out the instance id’s and cpu utilization. A single incident will be created in PagerDuty for this since we are grouping the alerts stating info about A and B. But say after some time another new alert got fired (consider C) because of some another instance breaching the threshold, and say of the earlier 2 alerts which were firing 1 got resolved (consider B got resolved), then in that case the incident still remains the same in PagerDuty, and doesn’t get updated which means I still see info about A and B in the PagerDuty incident. Also I can’t see any change in the incident creation time, and also the contents of the incident haven’t changed. But, if I go to Alerting → Groups in Grafana, I could see under CPU Utilization group, that A and C are firing and B went back to Normal. So its clear that from Grafana’s side things are clear with respect to alert rule evaluation, but then why isn’t the latest status being reflected out in the notification

If it would have been updated as mentioned by you in the PagerDuty incident, then there would have been no problem at all. And as mentioned earlier I don’t intend to disable grouping because a separate incident will be lodged for every alert instance which is a hassle.

I also had another approach in my mind for this. I was thinking to add creation time as dynamic label in alert instances. Then simply I would group on the basis of alertname and creation time. So for this case A and B fired together so a new incident would be lodged for them, and C fired later then a separate new incident would fire for them.
Unfortunately, I came to know about .startsAt for this scenario but also found that that it doesn’t work with labels and annotations

I would run Grafana with debug logs just to make sure updated notifications are being to Pagerduty.

Hi @georgerobinson, thanks for replying back again.

Can you please clarify your point a bit more. Which debug logs are you referring to here and how running Grafana with those logs make sure that the notifications are updated in PagerDuty

@georgerobinson, Awaiting your response

You’ll want to look for Notify success log lines for the alerts in question, and check that they are sent to Pagerduty when the group changes (either due to a new alert firing or a firing alert resolving).

@georgerobinson I think there’s some confusion from your end, I m not analysing log lines over here in my alert rule. For CPU Utilization, my datasource is AWS CloudWatch Metrics. So Notify success log lines over here makes no sense to me. I had already communicated the functionality I intend to achieve over here

Let me illustrate it with the help of a example, for better clarity
So we are grouping alerts on the basis of grafana folder and alertname. Say 2 alerts (consider A and B) are firing since they breached the 75% cpu utilization threshold, so in the incident with the help of notification template, I am displaying out the instance id’s and cpu utilization. A single incident will be created in PagerDuty for this since we are grouping the alerts stating info about A and B. But say after some time another new alert got fired (consider C) because of some another instance breaching the threshold, and say of the earlier 2 alerts which were firing 1 got resolved (consider B got resolved), then in that case the incident still remains the same in PagerDuty, and doesn’t get updated which means I still see info about A and B in the PagerDuty incident. Also I can’t see any change in the incident creation time, and also the contents of the incident haven’t changed. But, if I go to Alerting → Groups in Grafana, I could see under CPU Utilization group, that A and C are firing and B went back to Normal. So its clear that from Grafana’s side things are clear with respect to alert rule evaluation, but then why isn’t the latest status being reflected out in the notification

I either want a fresh new notification on PagerDuty to be triggered if there is any change in the grouping due to new alert instances getting fired or existing ones getting resolved, or atleast the notification should get updated in PagerDuty reflecting the latest ongoing status.

You’ll want to check for Notify success log lines in Grafana to see if the Pagerduty incident is getting updated or not. It should get updated, but you’re saying it’s not working.

I either want a fresh new notification on PagerDuty to be triggered if there is any change in the grouping due to new alert instances getting fired or existing ones getting resolved

I thought you didn’t want this, as you said in your previous message:

And as mentioned earlier I don’t intend to disable grouping because a separate incident will be lodged for every alert instance which is a hassle.

?

Hi @georgerobinson, thanks for your time and response.
First things first, I m using Amazon Managed Grafana. So in AMG can you tell me that where can we check the log lines. I m completely clueless about it.

Second thing. I m okish with having a new notification when there’s a change in the alert instances in the grouping. But I dont intend to disable grouping because if I do that and say 50 instances are down then in PagerDuty I will get 50 different incidents. I wish to have a setup where it will give me just one notification here for this 50 instances getting down, and this can be achieved through grouping. But later say after half an hour maybe 10 more went down, then I want to be notified on this part too. This part which I m referring to is not happening. So could you help me with it?

I’m not sure, you’ll need to ask Amazon customer support I’m afraid.

I m okish with having a new notification when there’s a change in the alert instances in the grouping. But I dont intend to disable grouping because if I do that and say 50 instances are down then in PagerDuty I will get 50 different incidents.

OK, so from Grafana’s perspective it’s working as intended. It updates the Pagerduty incident when the group changes. If you want a new notification from Pagerduty when an incident is updated then this is something you’ll need to configure in Pagerduty. It’s not something you can control from Grafana.

Hi @georgerobinson, thanks again for your time and response.

Are you sure on your part that whenever there’s a change in the grouping (say 2 were alerting initially in that group and after sometime it got to 4 in total) then updation is being done from the Grafana’s side to existing PagerDuty incident, as you stated

If you assure me this then in that case I will raise a ticket on PagerDuty then, because I didn’t find any configuration in place from PagerDuty’s side which I can enable/disable to tackle out this issue.

Grafana uses the Events v2 API. Here is an example where I created an incident from an alert, and then Grafana updated the incident because another alert fired. You can see here it changes from FIRING:1 to FIRING:2:

@georgerobinson, thanks again for your time and response.
I m still not sure if it has updated in actual. Can you share the whole timeline of this incident as a snip. Secondly, whatever you are seeing as [FIRING:1] is the description which is content of the PagerDuty incident and to it alert named [FIRING:2] has been attached. So just one alert is attached over here. Don’t get confused by [FIRING:1] and [FIRING:2]. Because if it would have updated in actual then you should have another timestamp say at 6:30 AM saying that

Alert xyz was automatically added to this incident

Have a look at my incident, this might clear up things more

If you observe it, its same as that of yours and no updation has taken place.

So the incident was created from the first notification which contained one firing alert for foo=bar with the title [FIRING:1]. Then a second alert, bar=baz fires in Grafana, and it updates the incident with a second notification [FIRING:2] containing both alerts: foo=bar and bar=baz.

@georgerobinson, thanks for your response. I got your point in the above answer. But can you show me a live example, so that I can also be convinced. Thats why I asked you to share the complete snip of the timeline section. Because the snip which you shared earlier doesn’t show the updated alerts. How come it is possible that the incident triggered at 6:22 AM itself due to say foo=bar and then also got updated at 6:22 AM itself by bar=baz.

Also, in the snip which you shared, can you tell me why isn’t there anything mentioned as

Alert [FIRING:1] xyz was automatically added to this incident

There’s no entry about the foo=bar alert. All I could see is that bar=baz alert is added

Alert [FIRING:2] xyz was automatically added to this incident

Also, in the snip which you shared, can you tell me why isn’t there anything mentioned as

Alert [FIRING:1] xyz was automatically added to this incident

It seems that the original notification that triggers the incident is not recorded as “automatically added to this alert”. Only subsequent notifications seem to have this message. That said, the original notification is still there when you click “View Message”.

Here is the foo=bar alert in the original message:

It’s then repeated in the second notification:

My recommendation would be to keep testing this further if you’re still not convinced, but from what I can see in my testing Grafana does send updates for existing Pagerduty incidents.

@georgerobinson thanks again for your response and time.
You are correct with your point. But can you do a favour by sharing the timeline of this incident, so that things will be more clear to me then. I have been asking this from my previous responses
By timeline of the incident, I meant this thing

But please share the whole timeline once, with all the timestamps and not just a single timestamp ,it will be very helpful for me

Yes that is the full timeline. I resolved the incident at 6:26AM in the user interface, but all the events from Grafana occurred at 6:22AM as I had a 30 second group_wait and group_interval.

I would like to thank @georgerobinson for his constant time and support. I would like to close this topic by concluding that:
In PagerDuty the incident summary is what doesn’t get updated but the alerts contained inside it works real-time and will give you the real time status whenever you go inside any incident in PagerDuty. Consider at 1000 hours, in a PagerDuty incident there were 2 alerts present. Later at 1600 hours, there were 5 alerts present. To check the current status you can visit that specific PagerDuty incident, scroll down and go to Alerts

Now if you have enabled Alert Grouping in Grafana and you wish to be notified further whenever there is any change in grouping then for that the only option is to disable Alert Grouping on Grafana instance, so that you will have a fresh new PagerDuty incident for every new alert triggered and you will be notified everytime by this. You can do that by going to Alerting → Notification Policies in Grafana

Also, disable grouping in PagerDuty service settings so that PagerDuty doesn’t group similar incidents, and you get notified for every alert/incident lodged

Please note, disable alert grouping only if your system is compatible enough and doesn’t trigger out a large number of alerts because if that’s not the case then you will receive hundreds of alerts when something major is down and Alert Grouping is the perfect solution to reduce noise.