Grafana alert state lost on restart

  • What Grafana version and what operating system are you using? 12.1.0, Alpine Linux v3.21

  • What are you trying to achieve? I’m trying to have my alert states retain their status when a grafana instance gets evicted.

  • How are you trying to achieve it? I have Grafana running HA in a k8s cluster (using the Grafana operator), and have a PDB of 1 with a 2 minute wait between evictions, in the attempt to try to allow it sufficient time to do a state sync between each other.

  • What happened? It seems that no matter what I do, the alert state gets lost, and alerts that were still firing will get resolved, with variable substitution also getting broken. Resolutions will return like:

###

:white_check_mark: Alert Resolved

{{ $labels.customresource_kind }} {{ $labels.name }} has failed to reconcile.

###

  • What did you expect to happen? The state of the alerts should have synced with each other, preventing the state to be lost.

I confirmed the sync is happening, and my metrics do recognize that there are 2 peers. I did also happen to notice my alert state history stopped working around the time I enabled HA. Forgive me if this is a novice question. I’m still fairly new to Grafana.

Hi, do you have any PVC or backend database configured? AFAIK the state is being saved in database and if there’s no persistent state, alerts might be resolved. Also, what do you mean by alert state history stopped working?

I have no PVC, I have a postgres DB hosted on RDS to store dashboards and users, but it doesn’t seem to store the state history. Would the general suggestion be to have it mount a PVC? As for the alert state history, there’s the history in the alerts page where it provides updates as to when a state shifts from normal to alerting, or missingseries. It seems to have data from back in April when I configured HA, but never updated since.

I should also note that the scrape intervals are a bit lengthy at 3 minutes to reduce costs.