Grafana alerting scalability

tehlersexpedia · May 1, 2018, 10:04pm

I’ve been experimenting with Grafana alerting on 5.1.0 with t2.medum (2 cores 4gb ram) grafana-server and t2.medium Aurora backend. I’m finding allot of scaling issues and i’m not sure what should be on the issue board and what should not.

My test setup:
i create a dashboard with 100 simple 1 series graphs to the Metrictank backend and have 1 alert. This should rule out graphite/Metrictank as a bottleneck as Metrictank will cache the query and return quickly. I then ramp-up adding a 2nd/3rd … 50th dashboard.

The queue is far too bursty? What do i mean by this. If everybody in your org puts an alert to run every 60s they all go into the queue at exactly 2:00:00 and miss their window if the queue is overrun. This has the effect of causing alerts to never fire. It seems a little after 5000 alerts this starts happening. For example

timeout 360 tail -f /var/log/grafana/grafana.log |grep ‘Alerting Benchmark9 test16 alert’
t=2018-05-01T21:41:10+0000 lvl=dbug msg=“Scheduler: Putting job on to exec queue” logger=alerting.scheduler name=“Alerting Benchmark9 test16 alert” id=15862
t=2018-05-01T21:41:10+0000 lvl=dbug msg=“Job Execution completed” logger=alerting.engine timeMs=111.415 alertId=15862 name=“Alerting Benchmark9 test16 alert” firing=false attemptID=1
t=2018-05-01T21:43:10+0000 lvl=dbug msg=“Scheduler: Putting job on to exec queue” logger=alerting.scheduler name=“Alerting Benchmark9 test16 alert” id=15862
t=2018-05-01T21:43:10+0000 lvl=dbug msg=“Job Execution completed” logger=alerting.engine timeMs=119.315 alertId=15862 name=“Alerting Benchmark9 test16 alert” firing=false attemptID=1
t=2018-05-01T21:44:10+0000 lvl=dbug msg=“Scheduler: Putting job on to exec queue” logger=alerting.scheduler name=“Alerting Benchmark9 test16 alert” id=15862
t=2018-05-01T21:44:11+0000 lvl=dbug msg=“Job Execution completed” logger=alerting.engine timeMs=-476.701 alertId=15862 name=“Alerting Benchmark9 test16 alert” firing=false attemptID=1
Terminated

The alerts list UI is un-usable at scale, opening https://grafana.yadda/alerting/list is not possible once you scale.
Should this queue be adjustable? https://github.com/grafana/grafana/blob/master/pkg/services/alerting/engine.go#L46
Clustering does not exist (known issue, Walmartlabs has a fork that works hopefully one day they can merge it!). If you are wondering what the alternate solution is, simply set only one node to ‘execute_alerts’ in your config.

Has anybody else gone above 5000 alerts in Grafana? What is your experience ?

user274 · June 4, 2018, 7:40pm

This interests us greatly… I am especially curious to hear what Grafana’s thoughts on this are…

cukal · September 1, 2018, 2:09am

Where Grafana shines in visualising it lacks in alerting, it’s pretty much just bolted on. If you want a lot of alerting control and need scale check out Prometheus AlertManager coupled with Alerta.

johntdyer · February 12, 2019, 1:40am

Kinda disappointed that no one from Grafana had chimed in here …

Topic		Replies	Views
Alert not firing Alerting alerting	7	8178	October 13, 2020
Grafana Alerting Scalability v8.0.0+ Alerting alerting , alert-notifications	2	667	June 13, 2022
Grafana Intermittent False Reporting with Alerting	2	670	January 13, 2021
Grafana Alerts for Custom Metrics Alerting	1	161	June 13, 2024
Alert Notifications Degrading Graphite Perfomance Graphite	1	783	June 17, 2017

Grafana alerting scalability

Related topics