Slack notifications spamming for 1 and only alert ๐Ÿค” Grafana Alerting in HA mode?

Hi community :wave:

I have a very strange issue :thinking:

Context

  • In my organization, we are using Grafana alerting (so far, so good :smile: )
  • We have recently setup HA for alerting (I hope I am not misleading with that information ^^)
  • Our Grafana instances are deployed with the Grafana-Operator on Kubernetes.

Observations

Observation1

  • A given alert has been in the firing state for more than 3d on 1 Grafana instance
  • On the other instance, that we can hit sometimes while going through the ALB ingress in front of both of them, we can see that the same alert is reported as โ€œfiringโ€ only for 16h

โ†’ :question: 1st question: How come?

:memo: Note:

  • The alert is based on a Prometheus query.
  • The Prometheus data source for this metric and alert is an ALB endpoint. Behind are 2 Prometheus servers.
  • We are aware that the data source does no โ€œstickinessโ€ regarding the Prometheus server targeted โ€ฆ We plan to use Thanos queries + Thanos-sidecars to address this issue later (This is not the main debate of this topic, I think :sweat_smile: )

Observation2

Here are some Grafana internal metrics from our 2 instances

โ†’ :question: Why donโ€™t we have the same amount of alerts in total

:memo: Both instances are using the same MariaDB instance as a backend.

Observation3

Here are some AlertManager cluster metrics

โ†’ :question: I have to admit, I donโ€™t know what metric I am supposed to look and I donโ€™t see anything obvious :sweat_smile: Except the amount of messages sent/received that match the period where the alert was spamming out Slack channel and till we silenced it.

Does any of you have suggestions regarding that issue?
I can provide more info details if required :innocent:

Thanks for your time and help :pray: