Grafana unable to start due to multiorg alertmanager manager failing

Hi,

We experienced an issue with resources being consumed by the BigQuery plugin causing out of memory manager to step in, killing processes which in turn triggered our health check to fail and the host to be recycled.

When the new host spun up, Grafana failed to start, complaining of “multiorg alertmanager manager failed to warm up”

You can see systemd attempting to restart the service with no success

Apr 01 12:19:27 i-08a67e121b8a61463 systemd[1]: Starting Grafana instance...
Apr 01 12:20:00 i-08a67e121b8a61463 grafana[1375]: Error: ✗ failed to initialize alerting because multiorg alertmanager manager failed to warm up: context deadline exceeded
Apr 01 12:20:00 i-08a67e121b8a61463 systemd[1]: grafana-server.service: Main process exited, code=exited, status=1/FAILURE
Apr 01 12:20:00 i-08a67e121b8a61463 systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Apr 01 12:20:00 i-08a67e121b8a61463 systemd[1]: Failed to start Grafana instance.
Apr 01 12:20:00 i-08a67e121b8a61463 systemd[1]: grafana-server.service: Service RestartSec=100ms expired, scheduling restart.
Apr 01 12:20:00 i-08a67e121b8a61463 systemd[1]: grafana-server.service: Scheduled restart job, restart counter is at 1.
Apr 01 12:20:00 i-08a67e121b8a61463 systemd[1]: Stopped Grafana instance.
Apr 01 12:20:00 i-08a67e121b8a61463 systemd[1]: Starting Grafana instance...
Apr 01 12:20:34 ip-10-113-26-92 grafana[1618]: Error: ✗ failed to initialize alerting because multiorg alertmanager manager failed to warm up: context deadline exceeded
Apr 01 12:20:34 ip-10-113-26-92 systemd[1]: grafana-server.service: Main process exited, code=exited, status=1/FAILURE
Apr 01 12:20:34 ip-10-113-26-92 systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Apr 01 12:20:34 ip-10-113-26-92 systemd[1]: Failed to start Grafana instance.
Apr 01 12:20:34 ip-10-113-26-92 systemd[1]: grafana-server.service: Service RestartSec=100ms expired, scheduling restart.
Apr 01 12:20:34 ip-10-113-26-92 systemd[1]: grafana-server.service: Scheduled restart job, restart counter is at 2.
Apr 01 12:20:34 ip-10-113-26-92 systemd[1]: Stopped Grafana instance.
Apr 01 12:20:34 ip-10-113-26-92 systemd[1]: Starting Grafana instance...
Apr 01 12:21:07 ip-10-113-26-92 grafana[1718]: Error: ✗ failed to initialize alerting because multiorg alertmanager manager failed to warm up: context deadline exceeded
Apr 01 12:21:07 ip-10-113-26-92 systemd[1]: grafana-server.service: Main process exited, code=exited, status=1/FAILURE
Apr 01 12:21:07 ip-10-113-26-92 systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Apr 01 12:21:07 ip-10-113-26-92 systemd[1]: Failed to start Grafana instance.
Apr 01 12:21:07 ip-10-113-26-92 systemd[1]: grafana-server.service: Service RestartSec=100ms expired, scheduling restart.
Apr 01 12:21:07 ip-10-113-26-92 systemd[1]: grafana-server.service: Scheduled restart job, restart counter is at 3.
Apr 01 12:21:07 ip-10-113-26-92 systemd[1]: Stopped Grafana instance.
Apr 01 12:21:07 ip-10-113-26-92 systemd[1]: Starting Grafana instance...
Apr 01 12:21:40 ip-10-113-26-92 grafana[1796]: Error: ✗ failed to initialize alerting because multiorg alertmanager manager failed to warm up: context deadline exceeded
Apr 01 12:21:40 ip-10-113-26-92 systemd[1]: grafana-server.service: Main process exited, code=exited, status=1/FAILURE
Apr 01 12:21:40 ip-10-113-26-92 systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Apr 01 12:21:40 ip-10-113-26-92 systemd[1]: Failed to start Grafana instance.
Apr 01 12:21:40 ip-10-113-26-92 systemd[1]: grafana-server.service: Service RestartSec=100ms expired, scheduling restart.
Apr 01 12:21:40 ip-10-113-26-92 systemd[1]: grafana-server.service: Scheduled restart job, restart counter is at 4.
Apr 01 12:21:40 ip-10-113-26-92 systemd[1]: Stopped Grafana instance.
Apr 01 12:21:40 ip-10-113-26-92 systemd[1]: Starting Grafana instance...
Apr 01 12:22:13 ip-10-113-26-92 grafana[1878]: Error: ✗ failed to initialize alerting because multiorg alertmanager manager failed to warm up: context deadline exceeded
Apr 01 12:22:13 ip-10-113-26-92 systemd[1]: grafana-server.service: Main process exited, code=exited, status=1/FAILURE
Apr 01 12:22:13 ip-10-113-26-92 systemd[1]: grafana-server.service: Failed with result 'exit-code'.
Apr 01 12:22:13 ip-10-113-26-92 systemd[1]: Failed to start Grafana instance.
Apr 01 12:22:13 ip-10-113-26-92 systemd[1]: grafana-server.service: Service RestartSec=100ms expired, scheduling restart.
Apr 01 12:22:13 ip-10-113-26-92 systemd[1]: grafana-server.service: Scheduled restart job, restart counter is at 5.
Apr 01 12:22:13 ip-10-113-26-92 systemd[1]: Stopped Grafana instance.
Apr 01 12:22:13 ip-10-113-26-92 systemd[1]: Starting Grafana instance...
Apr 01 12:22:30 ip-10-113-26-92 systemd[1]: grafana-server.service: Succeeded.
Apr 01 12:22:30 ip-10-113-26-92 systemd[1]: Stopped Grafana instance.
Apr 01 12:22:30 ip-10-113-26-92 systemd[1]: Starting Grafana instance...
Apr 01 12:22:34 ip-10-113-26-92 systemd[1]: Started Grafana instance.

To get the service to start up, I had to comment out HA alerting, start the service up then re-enable it afterwards.

I was wondering if Alerting: Panic after failing to create an Alertmanager for an org (duplicate metrics collector registration attempted) · Issue #92707 · grafana/grafana · GitHub could be related to my issue? Currently running Grafana v10.4.8 (7a08b0bae7).

Anyone got any ideas? In the meantime I’m going to try and recreate this outside of our production systems :slight_smile: