Hello Team,
We are monitoring multiple environments(like standalone server, Kubernetes, GCP, and IBM MQ) using Prometheus & Grafana and grafana alerting is being used for Event management. When it comes to alerting side, It will be difficult to create same type of alerting rule for each metric(Like CPU, memory, filesystem, etc). Somehow we managed to achieve this by creating a separate panel for each metric.
But the challenge is there when you place alerts for MQ. Not sure if your team is aware of this. let me try to simplify using cluster example. Let’s say we have many clusters with multiple nodes running there and placed monitoring node wise. In some node, we are monitoring 10 metrics and in some, it is 100. And for each metric, we have to set different types of alerts for different teams.
Example:
if CPU utilization > 60, send an alert to A team
If CPU utilization > 70, send an alert to B team
if CPU utilization > 80, send alert to C team
(In this way we have to set alerts for other metrics(memory, filesystem, etc.) as well. For few metrics, there is 5 rule and for a few it would be 10-20.)
If we follow the above example with the current grafana provisioning setup, we have to create 80-100 panels for one instance. When it comes to monitoring more than 1000 instances, it would be really difficult to set up such alerting system and managing these rules as well. As most of the time requirements will keep change and being asked to change threshold instantly(Like A team reached us to get alert CPU utilization > 65 then we need alert). Even this is a small modification from the threshold point but we have to open each panel and have to make the changes. As lots of manual work been placed, human error will be common for alerting pieces which would not be a reliable solution.
So my question is, Is there any way that we can create an alerting template and can align those to multiple instances. Like: cpu_alerting_60 where we can set up the threshold and place multiple instance details. So, it will replicate the rule across all mentioned instances.
Please provide you valuable suggestion to achieve this setup. Thanks in advance