Hi community
I have a very strange issue
Context
- In my organization, we are using Grafana alerting (so far, so good )
- We have recently setup HA for alerting (I hope I am not misleading with that information ^^)
- Our Grafana instances are deployed with the Grafana-Operator on Kubernetes.
Observations
Observation1
- A given alert has been in the firing state for more than 3d on 1 Grafana instance
- On the other instance, that we can hit sometimes while going through the ALB ingress in front of both of them, we can see that the same alert is reported as โfiringโ only for 16h
โ 1st question: How come?
Note:
- The alert is based on a Prometheus query.
- The Prometheus data source for this metric and alert is an ALB endpoint. Behind are 2 Prometheus servers.
- We are aware that the data source does no โstickinessโ regarding the Prometheus server targeted โฆ We plan to use Thanos queries + Thanos-sidecars to address this issue later (This is not the main debate of this topic, I think )
Observation2
Here are some Grafana internal metrics from our 2 instances
โ Why donโt we have the same amount of alerts in total
Both instances are using the same MariaDB instance as a backend.
Observation3
Here are some AlertManager cluster metrics
โ I have to admit, I donโt know what metric I am supposed to look and I donโt see anything obvious Except the amount of messages sent/received that match the period where the alert was spamming out Slack channel and till we silenced it.
Does any of you have suggestions regarding that issue?
I can provide more info details if required
Thanks for your time and help