Grafana Alertmanager cluster size

  • What Grafana version and what operating system are you using?
    I have tested with Grafana 11.4.1 and 11.6.1

  • What are you trying to achieve?
    I’m trying to run multiple Grafana setups on one Kubernetes cluster. Each setup has two replicas.

  • How are you trying to achieve it?
    I’m using HELM chart. My HA setup:

...
  extraExposePorts:
    - name: "grafana-alert"
      port: 9094
      targetPort: 9094

  grafana.ini:
...
    unified_alerting:
      enabled: true
      ha_listen_address: "${grafana_pod_ip}:9094"
      ha_advertise_address: "${grafana_pod_ip}:9094"
      ha_peers: "$CLUSTER_NAME:9094"
...

  serviceMonitor:
    enabled: true

...

Each grafana setup is in different namespace.

  • What happened?
    Alertmanager is connecting to ALL alertmanagers on Kubernetes cluster. If I create 20 or 30 Grafana setups, they all connect to each other. ( 20 setups * 2 replicas = 40 alertmanager peers ) WHY? How to prevent this?
    Seams because of this mesh clustering alerts are delayed to deliver.

  • What did you expect to happen?
    Each grafana setup connect ONLY to it’s grafana alertmanagers.

  • Can you copy/paste the configuration(s) that you are having problems with?
    Not shure which part.

  • Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.
    If I delete deployment and re-create it it has only replica in alertmanager peers, but in few minutes to alert manager all grafana setups are introduced. Log example:

logger=ngalert.multiorg.alertmanager component=clustering t=2025-05-08T09:08:34.884518081Z level=debug memberlist="2025/05/08 09:08:34 [DEBUG] memberlist: Stream connection from=10.15.99.117:41146\n"
logger=ngalert.multiorg.alertmanager component=clustering t=2025-05-08T09:08:34.885068991Z level=debug received=NotifyJoin node=01JTJB58VHEEKEHK1AZCC9C7R5 addr=10.15.118.205:9094
logger=ngalert.multiorg.alertmanager component=clustering t=2025-05-08T09:08:34.885105163Z level=debug received=NotifyJoin node=01JTJRRZ26HXEJCW0S28PMK34G addr=10.15.184.150:9094
logger=ngalert.multiorg.alertmanager component=clustering t=2025-05-08T09:08:34.885126984Z level=debug received=NotifyJoin node=01JTJPQQMBXGB4AMP3449S5ZF7 addr=10.15.106.212:9094
logger=ngalert.multiorg.alertmanager component=clustering t=2025-05-08T09:08:34.885145705Z level=debug received=NotifyJoin node=01JTJB43D13BVHKJTDM6A16S2J addr=10.15.31.177:9094
logger=ngalert.multiorg.alertmanager component=clustering t=2025-05-08T09:08:34.885163786Z level=debug received=NotifyJoin node=01JTJWW3VM6N1TQ6YYBCBMASGV addr=10.15.129.87:9094
...

I’m confused how Grafana Alertmanager is discovering peers. I have done some experiments disabling or enabling things, but didn’t understood mechanics behind it. I’m seeking for good explanation how that discovery works and how to prevent it :slight_smile:

Is CLUSTER_NAME the same for all namespaces? That could explain why they all join same cluster. ha_peers should point to a headless service scoped to the namespace.