I’m using:
- Grafana v8.3.2
- Helm Chart v6.20.5
- Helm v3
- Postgress v14.1
Here are the related configurations in my values.yaml:
…
headlessService: true
…
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 6
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 60
- type: Resource
resource:
name: memory
targetAverageUtilization: 60
…
grafana.ini:
paths:
data: /var/lib/grafana/
logs: /var/log/grafana
plugins: /var/lib/grafana/plugins
provisioning: /etc/grafana/provisioning
analytics:
check_for_updates: true
log:
mode: console
level: info
grafana_net:
url: https://grafana.net
alerting:
enabled: false
unified_alerting:
enabled: true
ha_peers: grafana-infrastructure-headless:9094
database:
type: postgres
host: x.x.x.x:5432
name: grafana
user: xxxxxxxx
password: xxxxxxxxxxxxxxxx
…
With this configuration I expected alerts/notifications to be deduplicated by the alertManager but this is not occurring; I am receiving duplicate alerts/notifications from each pod.
Looking in the k8s logs of the 3 pods that initially start I find similar lines as follows:
POD1: t=2022-01-12T22:16:58+0000 lvl=info msg="component=cluster level=debug msg=\"resolved peers to following addresses\" peers=10.42.25.181:9094,10.42.23.154:9094,10.42.7.24:9094" logger=ngalert.multiorg.alertmanager
POD2: t=2022-01-12T22:17:19+0000 lvl=info msg="component=cluster level=debug msg=\"resolved peers to following addresses\" peers=10.42.7.24:9094,10.42.25.186:9094,10.42.23.154:9094" logger=ngalert.multiorg.alertmanager
POD3: t=2022-01-12T22:17:10+0000 lvl=info msg="component=cluster level=debug msg=\"resolved peers to following addresses\" peers=10.42.25.186:9094,10.42.23.154:9094,10.42.7.24:9094" logger=ngalert.multiorg.alertmanager
From this it appears the “grafana.ini ha_peers = grafana-infrastructure-headless:9094” setting shown above is working and the values are set to the IPs of the peer pods that exist at that moment when grafana-infrastructure-headless:9094 is looked up, which is understandable.
Question here, when k8s pods get added/moved/deleted is the grafana-infrastructure-headless:9094 re-looked up and reloaded so the peer IPs for the cluster are up to date?
After this, I start seeing the following “Failed to join … i/o timeout” log entries repeated over and over in each k8s pod:
…
t=2022-01-12T22:21:50+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:21:50 [DEBUG] memberlist: Failed to join 10.42.7.24: dial tcp 10.42.7.24:9094: i/o timeout\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:21:50+0000 lvl=info msg="component=cluster level=debug msg=reconnect result=failure peer=01FS848WMQ30TNGPEAN2Y3QSD2 addr=10.42.7.24:9094 err=\"1 error occurred:\\n\\t* Failed to join 10.42.7.24: dial tcp 10.42.7.24:9094: i/o timeout\\n\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:21:52+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:21:52 [DEBUG] memberlist: Stream connection from=10.42.23.155:58076\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:00+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:22:00 [DEBUG] memberlist: Failed to join 10.42.23.154: dial tcp 10.42.23.154:9094: i/o timeout\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:00+0000 lvl=info msg="component=cluster level=debug msg=reconnect result=failure peer=01FS848X4AXW63ZN0PSXZZ25D4 addr=10.42.23.154:9094 err=\"1 error occurred:\\n\\t* Failed to join 10.42.23.154: dial tcp 10.42.23.154:9094: i/o timeout\\n\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:09+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:22:09 [DEBUG] memberlist: Initiating push/pull sync with: 01FS85V6VF3HXC4X207NC0EBPM 10.42.25.186:9094\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:10+0000 lvl=info msg="component=cluster level=debug memberlist=\"2022/01/12 22:22:10 [DEBUG] memberlist: Failed to join 10.42.7.24: dial tcp 10.42.7.24:9094: i/o timeout\\n\"" logger=ngalert.multiorg.alertmanager
t=2022-01-12T22:22:10+0000 lvl=info msg="component=cluster level=debug msg=reconnect result=failure peer=01FS848WMQ30TNGPEAN2Y3QSD2 addr=10.42.7.24:9094 err=\"1 error occurred:\\n\\t* Failed to join 10.42.7.24: dial tcp 10.42.7.24:9094: i/o timeout\\n\\n\"" logger=ngalert.multiorg.alertmanager
…
Per the documentation it seems I have configured the Helm Charts correctly, any recommendation on how to fix this so the alertManagers are deduplicating the alerts/notifications?