Alerts/notifications are not deduplicated when using HA unified alerting

jbs331 · January 13, 2022, 12:01am

I’m using:

Grafana v8.3.2
Helm Chart v6.20.5
Helm v3
Postgress v14.1

Here are the related configurations in my values.yaml:

…
headlessService: true
…
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 6
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 60
  - type: Resource
    resource:
      name: memory
      targetAverageUtilization: 60
…
grafana.ini:
  paths:
    data: /var/lib/grafana/
    logs: /var/log/grafana
    plugins: /var/lib/grafana/plugins
    provisioning: /etc/grafana/provisioning
  analytics:
    check_for_updates: true
  log:
    mode: console
    level: info
  grafana_net:
    url: https://grafana.net
  alerting:
    enabled: false
  unified_alerting:
    enabled: true
    ha_peers: grafana-infrastructure-headless:9094
  database:
    type: postgres
    host: x.x.x.x:5432
    name: grafana
    user: xxxxxxxx
    password: xxxxxxxxxxxxxxxx
…

With this configuration I expected alerts/notifications to be deduplicated by the alertManager but this is not occurring; I am receiving duplicate alerts/notifications from each pod.

Looking in the k8s logs of the 3 pods that initially start I find similar lines as follows:

POD1:  t=2022-01-12T22:16:58+0000 lvl=info msg="component=cluster level=debug msg=\"resolved peers to following addresses\" peers=10.42.25.181:9094,10.42.23.154:9094,10.42.7.24:9094" logger=ngalert.multiorg.alertmanager
POD2:  t=2022-01-12T22:17:19+0000 lvl=info msg="component=cluster level=debug msg=\"resolved peers to following addresses\" peers=10.42.7.24:9094,10.42.25.186:9094,10.42.23.154:9094" logger=ngalert.multiorg.alertmanager
POD3:  t=2022-01-12T22:17:10+0000 lvl=info msg="component=cluster level=debug msg=\"resolved peers to following addresses\" peers=10.42.25.186:9094,10.42.23.154:9094,10.42.7.24:9094" logger=ngalert.multiorg.alertmanager

From this it appears the “grafana.ini ha_peers = grafana-infrastructure-headless:9094” setting shown above is working and the values are set to the IPs of the peer pods that exist at that moment when grafana-infrastructure-headless:9094 is looked up, which is understandable.

Question here, when k8s pods get added/moved/deleted is the grafana-infrastructure-headless:9094 re-looked up and reloaded so the peer IPs for the cluster are up to date?

After this, I start seeing the following “Failed to join … i/o timeout” log entries repeated over and over in each k8s pod:

…
t=2022-01-12T22:21:50+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:21:50 [DEBUG] memberlist: Failed to join 10.42.7.24: dial tcp 10.42.7.24:9094: i/o timeout\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:21:50+0000 lvl=info msg="component=cluster level=debug msg=reconnect result=failure peer=01FS848WMQ30TNGPEAN2Y3QSD2 addr=10.42.7.24:9094 err=\"1 error occurred:\\n\\t* Failed to join 10.42.7.24: dial tcp 10.42.7.24:9094: i/o timeout\\n\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:21:52+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:21:52 [DEBUG] memberlist: Stream connection from=10.42.23.155:58076\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:00+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:22:00 [DEBUG] memberlist: Failed to join 10.42.23.154: dial tcp 10.42.23.154:9094: i/o timeout\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:00+0000 lvl=info msg="component=cluster level=debug msg=reconnect result=failure peer=01FS848X4AXW63ZN0PSXZZ25D4 addr=10.42.23.154:9094 err=\"1 error occurred:\\n\\t* Failed to join 10.42.23.154: dial tcp 10.42.23.154:9094: i/o timeout\\n\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:09+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:22:09 [DEBUG] memberlist: Initiating push/pull sync with: 01FS85V6VF3HXC4X207NC0EBPM 10.42.25.186:9094\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:10+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:22:10 [DEBUG] memberlist: Failed to join 10.42.7.24: dial tcp 10.42.7.24:9094: i/o timeout\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:10+0000 lvl=info msg="component=cluster level=debug msg=reconnect result=failure peer=01FS848WMQ30TNGPEAN2Y3QSD2 addr=10.42.7.24:9094 err=\"1 error occurred:\\n\\t* Failed to join 10.42.7.24: dial tcp 10.42.7.24:9094: i/o timeout\\n\\n\"" logger=ngalert.multiorg.alertmanager
…

Per the documentation it seems I have configured the Helm Charts correctly, any recommendation on how to fix this so the alertManagers are deduplicating the alerts/notifications?

jbs331 · January 18, 2022, 4:04pm

I found the following two conversations that reference each other:

github.com/grafana/grafana

Alerting: add settings for peer reconnection in HA mode

grafana:main ← grafana:alerting-ha-add-peer-reconnect-settings

opened 01:49PM - 25 Nov 21 UTC

JohnnyQQQQ

+53 -6

**What this PR does / why we need it**: This PR adds the ability to configure the peer reconnection settings. One can set the desired timeout and reconnect interval. This is very useful when running in an ephemeral environment like Kubernetes, where an ungraceful shutdown might happen, and you don't want to try old peers for 6 hours (which is the default). **Special notes for your reviewer**: This is not a breaking change, as the default values stay the same.

To me these are basically saying Unified Alerting has issues when running under Kubernetes and a possible fix may be available in v8.4. Can anyone confirm?

luvpreet · April 14, 2023, 10:36am

@jbs331 were you able to find a solution to this?

jbs331 · April 14, 2023, 5:16pm

@luvpreet To mitigate the issue (mid-year 2022), I reduced our replica count to 1, tweaking memory/cpu to handle the single replica demand. This resulted in no HA, but it was preferable to receiving duplicate alarms. To resolve the issue around that time, I upgraded from v8.3.2 to v8.5.9. This included utilizing the most recent Helm Charts for the new version, which is where I believe the problem resided.

Topic		Replies	Views
Grafana alerting ha duplicate alert history Alerting	5	908	June 7, 2023
Grafana Alerting HA Notification Deduplication Implementation Alerting alerting , ha-grafana , alert-notifications	1	465	March 1, 2024
Unified Alerting HA ways Alerting alerting	0	486	July 13, 2022
Grafana producing multiple alerts in scaled Kubernetes installation Alerting kubernetes	2	1192	August 19, 2022
Grafana sending multiple emails for the same alert Alerting alerting	7	1535	October 10, 2023

Alerts/notifications are not deduplicated when using HA unified alerting

Related topics