Alerts/notifications are not deduplicated when using HA unified alerting

I’m using:

  • Grafana v8.3.2
  • Helm Chart v6.20.5
  • Helm v3
  • Postgress v14.1

Here are the related configurations in my values.yaml:

…
headlessService: true
…
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 6
  metrics:
  - type: Resource
    resource:
      name: cpu
      targetAverageUtilization: 60
  - type: Resource
    resource:
      name: memory
      targetAverageUtilization: 60
…
grafana.ini:
  paths:
    data: /var/lib/grafana/
    logs: /var/log/grafana
    plugins: /var/lib/grafana/plugins
    provisioning: /etc/grafana/provisioning
  analytics:
    check_for_updates: true
  log:
    mode: console
    level: info
  grafana_net:
    url: https://grafana.net
  alerting:
    enabled: false
  unified_alerting:
    enabled: true
    ha_peers: grafana-infrastructure-headless:9094
  database:
    type: postgres
    host: x.x.x.x:5432
    name: grafana
    user: xxxxxxxx
    password: xxxxxxxxxxxxxxxx
…

With this configuration I expected alerts/notifications to be deduplicated by the alertManager but this is not occurring; I am receiving duplicate alerts/notifications from each pod.

Looking in the k8s logs of the 3 pods that initially start I find similar lines as follows:

POD1:  t=2022-01-12T22:16:58+0000 lvl=info msg="component=cluster level=debug msg=\"resolved peers to following addresses\" peers=10.42.25.181:9094,10.42.23.154:9094,10.42.7.24:9094" logger=ngalert.multiorg.alertmanager
POD2:  t=2022-01-12T22:17:19+0000 lvl=info msg="component=cluster level=debug msg=\"resolved peers to following addresses\" peers=10.42.7.24:9094,10.42.25.186:9094,10.42.23.154:9094" logger=ngalert.multiorg.alertmanager
POD3:  t=2022-01-12T22:17:10+0000 lvl=info msg="component=cluster level=debug msg=\"resolved peers to following addresses\" peers=10.42.25.186:9094,10.42.23.154:9094,10.42.7.24:9094" logger=ngalert.multiorg.alertmanager

From this it appears the “grafana.ini ha_peers = grafana-infrastructure-headless:9094” setting shown above is working and the values are set to the IPs of the peer pods that exist at that moment when grafana-infrastructure-headless:9094 is looked up, which is understandable.

Question here, when k8s pods get added/moved/deleted is the grafana-infrastructure-headless:9094 re-looked up and reloaded so the peer IPs for the cluster are up to date?

After this, I start seeing the following “Failed to join … i/o timeout” log entries repeated over and over in each k8s pod:

…
t=2022-01-12T22:21:50+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:21:50 [DEBUG] memberlist: Failed to join 10.42.7.24: dial tcp 10.42.7.24:9094: i/o timeout\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:21:50+0000 lvl=info msg="component=cluster level=debug msg=reconnect result=failure peer=01FS848WMQ30TNGPEAN2Y3QSD2 addr=10.42.7.24:9094 err=\"1 error occurred:\\n\\t* Failed to join 10.42.7.24: dial tcp 10.42.7.24:9094: i/o timeout\\n\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:21:52+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:21:52 [DEBUG] memberlist: Stream connection from=10.42.23.155:58076\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:00+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:22:00 [DEBUG] memberlist: Failed to join 10.42.23.154: dial tcp 10.42.23.154:9094: i/o timeout\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:00+0000 lvl=info msg="component=cluster level=debug msg=reconnect result=failure peer=01FS848X4AXW63ZN0PSXZZ25D4 addr=10.42.23.154:9094 err=\"1 error occurred:\\n\\t* Failed to join 10.42.23.154: dial tcp 10.42.23.154:9094: i/o timeout\\n\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:09+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:22:09 [DEBUG] memberlist: Initiating push/pull sync with: 01FS85V6VF3HXC4X207NC0EBPM 10.42.25.186:9094\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:10+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:22:10 [DEBUG] memberlist: Failed to join 10.42.7.24: dial tcp 10.42.7.24:9094: i/o timeout\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:10+0000 lvl=info msg="component=cluster level=debug msg=reconnect result=failure peer=01FS848WMQ30TNGPEAN2Y3QSD2 addr=10.42.7.24:9094 err=\"1 error occurred:\\n\\t* Failed to join 10.42.7.24: dial tcp 10.42.7.24:9094: i/o timeout\\n\\n\"" logger=ngalert.multiorg.alertmanager
…

Per the documentation it seems I have configured the Helm Charts correctly, any recommendation on how to fix this so the alertManagers are deduplicating the alerts/notifications?

I found the following two conversations that reference each other:

To me these are basically saying Unified Alerting has issues when running under Kubernetes and a possible fix may be available in v8.4. Can anyone confirm?

@jbs331 were you able to find a solution to this?

@luvpreet To mitigate the issue (mid-year 2022), I reduced our replica count to 1, tweaking memory/cpu to handle the single replica demand. This resulted in no HA, but it was preferable to receiving duplicate alarms. To resolve the issue around that time, I upgraded from v8.3.2 to v8.5.9. This included utilizing the most recent Helm Charts for the new version, which is where I believe the problem resided.

1 Like