I currently have a few thousand alert rules that will probably grow to a few ten thousand because we need to make an exact copy of a previous setup in another alerting system. The interval of all checks is set at 5 minutes. I was hoping that the alertmanager would spread these evenly over each five minute period, but they all get sent out at the exact same time. While the datasource would be able to handle it if all requests were spread out over time, it sometimes has problems taking in this big peak of requests.
Of course the problem is twofold. We will work on expanding our infrastructure on the side of the datasource, but is there any way to make the alertmanager spread out alerts over the whole timeframe?
Hi! It’s not possible at present I’m afraid.
Thank you for the answer.
I’d like to continue with a followup then, which might be a bit too broad but I’m at a loss and would appreciate any ideas. How do large volumes of checks like these get handled normally? Our datasource would easily be able to handle all requests if they were spread out over the whole timeframe, it just cannot handle all of them at the same time.
Is the answer to use a different alertmanager or is there some solution?
What is the datasource? Is it Prometheus? If it is then for this volume of alert rules I’d recommend looking at Grafana Mimir OSS | Prometheus long-term storage.
We use a Prometheus-like tsdb as datasource with options for alerting too, so that is an option. Going for Mimir will probably be out of the question. Thank you for all of your answers again.
We’ve got the exact same issue - grafana alerting isn’t scalable because of this so I have raised a feature request for it - Alerting: distribute alert rule evaluations over time · Issue #75544 · grafana/grafana · GitHub