Grafana on Kubernetes - Notification duplicate in a HA setup

Reposting here since I didn’t get any answers on Stack.

I’ve set up Grafana by deploying the official helm chart with ArgoCD. I have 3 grafana pods running. In order to achieve HA and to avoid having duplicate notifications, I set up the unified_alerting part in grafana.ini like so :

unified_alerting:
    enabled: true
    ha_peers: "grafana-headless.grafana.svc.cluster.local:9094"
    ha_peer_timeout: "30s"
    ha_listen_address: "${POD_IP}:9094"
    ha_advertise_address: "${POD_IP}:9094"
    ha_reconnect_timeout: 2m

I see no errors in the logs and the alertmanager metrics’ shows the following :

alertmanager_cluster_members 3
alertmanager_cluster_failed_peers 0
alertmanager_cluster_health_score 0

So I can tell that the configuration works. However, when an alert is firing, I receive 2 notifications and I can’t figure out why. This happens with multiple notification policies (happens with email notifications and Teams notifications). Sometimes, I receive the duplicate a few seconds after the first one (in fact the time set for ha_peer_timeout) and sometimes I have no duplicate. I also find it weird to have 2 notifications, I would assume that a problem with HA would give me 3 notifications (one for each pod).

I’d like to be able to receive only one notification because it floods my notifications channels since I have a lot of alerts.

What about other metrics?

Here are the metrics for all members of the cluster :

grafana-0 :
alertmanager_cluster_alive_messages_total{peer="XXXX"} 8047
alertmanager_cluster_alive_messages_total{peer="YYYY"} 8047
alertmanager_cluster_alive_messages_total{peer="ZZZZ"} 8044
alertmanager_cluster_failed_peers 0
alertmanager_cluster_health_score 0
alertmanager_cluster_members 3
alertmanager_cluster_messages_pruned_total 0
alertmanager_cluster_messages_queued 0
alertmanager_cluster_messages_received_size_total{msg_type="full_state"} 3.160593013e+09
alertmanager_cluster_messages_received_size_total{msg_type="update"} 322237
alertmanager_cluster_messages_received_total{msg_type="full_state"} 8047
alertmanager_cluster_messages_received_total{msg_type="update"} 290
alertmanager_cluster_messages_sent_size_total{msg_type="full_state"} 3.160078558e+09
alertmanager_cluster_messages_sent_size_total{msg_type="update"} 61305
alertmanager_cluster_messages_sent_total{msg_type="full_state"} 8047
alertmanager_cluster_messages_sent_total{msg_type="update"} 150
alertmanager_cluster_peer_info{peer="ZZZZ"} 1
alertmanager_cluster_peers_joined_total 3
alertmanager_cluster_peers_left_total 0
alertmanager_cluster_peers_update_total 0
alertmanager_cluster_reconnections_failed_total 0
alertmanager_cluster_reconnections_total 0
alertmanager_cluster_refresh_join_failed_total 0
alertmanager_cluster_refresh_join_total 0
alertmanager_peer_position 2

grafana-1:
alertmanager_cluster_alive_messages_total{peer="AAAA"} 2
alertmanager_cluster_alive_messages_total{peer="XXXX"} 8119
alertmanager_cluster_alive_messages_total{peer="YYYY"} 8118
alertmanager_cluster_alive_messages_total{peer="ZZZZ"} 8117
alertmanager_cluster_failed_peers 0
alertmanager_cluster_health_score 0
alertmanager_cluster_members 3
alertmanager_cluster_messages_pruned_total 0
alertmanager_cluster_messages_queued 0
alertmanager_cluster_messages_received_size_total{msg_type="full_state"} 3.179963594e+09
alertmanager_cluster_messages_received_size_total{msg_type="update"} 309877
alertmanager_cluster_messages_received_total{msg_type="full_state"} 8119
alertmanager_cluster_messages_received_total{msg_type="update"} 279
alertmanager_cluster_messages_sent_size_total{msg_type="full_state"} 3.178919265e+09
alertmanager_cluster_messages_sent_size_total{msg_type="update"} 146583
alertmanager_cluster_messages_sent_total{msg_type="full_state"} 8120
alertmanager_cluster_messages_sent_total{msg_type="update"} 360
alertmanager_cluster_peer_info{peer="YYYY"} 1
alertmanager_cluster_peers_joined_total 5
alertmanager_cluster_peers_left_total 2
alertmanager_cluster_peers_update_total 0
alertmanager_cluster_reconnections_failed_total 28
alertmanager_cluster_reconnections_total 1
alertmanager_cluster_refresh_join_failed_total 0
alertmanager_cluster_refresh_join_total 1
alertmanager_peer_position 1

grafana-2:
alertmanager_cluster_alive_messages_total{peer="BBBB"} 2
alertmanager_cluster_alive_messages_total{peer="AAAA"} 3
alertmanager_cluster_alive_messages_total{peer="XXXX"} 8100
alertmanager_cluster_alive_messages_total{peer="YYYY"} 8099
alertmanager_cluster_alive_messages_total{peer="ZZZZ"} 8097
alertmanager_cluster_failed_peers 0
alertmanager_cluster_health_score 0
alertmanager_cluster_members 3
alertmanager_cluster_messages_pruned_total 0
alertmanager_cluster_messages_queued 0
alertmanager_cluster_messages_received_size_total{msg_type="full_state"} 3.167123413e+09
alertmanager_cluster_messages_received_size_total{msg_type="update"} 82074
alertmanager_cluster_messages_received_total{msg_type="full_state"} 8100
alertmanager_cluster_messages_received_total{msg_type="update"} 73
alertmanager_cluster_messages_sent_size_total{msg_type="full_state"} 3.166957212e+09
alertmanager_cluster_messages_sent_size_total{msg_type="update"} 210930
alertmanager_cluster_messages_sent_total{msg_type="full_state"} 8102
alertmanager_cluster_messages_sent_total{msg_type="update"} 510
alertmanager_cluster_peer_info{peer="XXXX"} 1
alertmanager_cluster_peers_joined_total 5
alertmanager_cluster_peers_left_total 2
alertmanager_cluster_peers_update_total 0
alertmanager_cluster_reconnections_failed_total 29
alertmanager_cluster_reconnections_total 0
alertmanager_cluster_refresh_join_failed_total 0
alertmanager_cluster_refresh_join_total 0
alertmanager_peer_position 0

Why there is a peer AAAA and BBBB? Is it possible that those are outside of cluster and they are sending alerts as well? Check number of pods directly on the cluster.

This is because of ArgoCD. When I did tests with different configurations, the pods are redeployed one by one. So the first pod to be redeployed is able to see pod AAAA which is replaced later by XXXX. That is why grafana-2 sees A and B, grafana-1 sees A and grafana-0 none of the old pods. I can confirm however that pods AAAA and BBBB are not up anymore and not sending alerts to the cluster.

OK, then I think that’s by design:

While the alert generator evaluates all alert rules on all instances, the alert receiver makes a best-effort attempt to avoid duplicate notifications. The alertmanagers use a gossip protocol to share information between them to prevent sending duplicated notifications.

Alertmanager chooses availability over consistency, which may result in occasional duplicated or out-of-order notifications. It takes the opinion that duplicate or out-of-order notifications are better than no notifications.

It can be that edge case, when 2 pods sends notifications, but there is no enough time for gossip to propagate state - that deduplication is only best effort. This problem indicates that you received the duplicate a few seconds after the first notification + it is happening sometimes.

I thought so too but pretty much every notification sent has a duplicate in my case. 30s is a big time out period and it is reached so often that I think there might be an issue somewhere else.
I agree with you that I’m in the case of best effort deduplication, but it should not happen that often.

1 Like