Alerting Support for multiple Hosts and Series

Hello Team,

We are monitoring multiple environments(like standalone server, Kubernetes, GCP, and IBM MQ) using Prometheus & Grafana and grafana alerting is being used for Event management. When it comes to alerting side, It will be difficult to create same type of alerting rule for each metric(Like CPU, memory, filesystem, etc). Somehow we managed to achieve this by creating a separate panel for each metric.

But the challenge is there when you place alerts for MQ. Not sure if your team is aware of this. let me try to simplify using cluster example. Let’s say we have many clusters with multiple nodes running there and placed monitoring node wise. In some node, we are monitoring 10 metrics and in some, it is 100. And for each metric, we have to set different types of alerts for different teams.

if CPU utilization > 60, send an alert to A team
If CPU utilization > 70, send an alert to B team
if CPU utilization > 80, send alert to C team
(In this way we have to set alerts for other metrics(memory, filesystem, etc.) as well. For few metrics, there is 5 rule and for a few it would be 10-20.)

If we follow the above example with the current grafana provisioning setup, we have to create 80-100 panels for one instance. When it comes to monitoring more than 1000 instances, it would be really difficult to set up such alerting system and managing these rules as well. As most of the time requirements will keep change and being asked to change threshold instantly(Like A team reached us to get alert CPU utilization > 65 then we need alert). Even this is a small modification from the threshold point but we have to open each panel and have to make the changes. As lots of manual work been placed, human error will be common for alerting pieces which would not be a reliable solution.

So my question is, Is there any way that we can create an alerting template and can align those to multiple instances. Like: cpu_alerting_60 where we can set up the threshold and place multiple instance details. So, it will replicate the rule across all mentioned instances.

Please provide you valuable suggestion to achieve this setup. Thanks in advance :slight_smile:

Good Day @melrose,

In short, how to set one rule for multiple instances which can generate event independently?

Is there any way to create alerting templates?

My requirement is, I have to place multiple alert rules for different metrics which would notify various application teams accordingly. When I am talking about multiple, the count should consider as more than 1000.

I mean it to a huge number of instances, where we have to place same types of alerting. If we will use special character in query to monitor all instances in same panel. it will alert for once, in consecutive time if any other instance will be alerted will not trigger a new event. So i am looking for a solution, how we can place alerting solution for same metrics accross all instances?

i hope this clarified my question :slight_smile:

Let me try to brief this again using snap snippet. So here, i am monitoring most of the pod in one panel(let’s say we are monitoring cpu utilization here for 80%). How i will setup the alerting solution here, so it will trigger an individual event for each pod?

another snapshot for this, where we are monitoring multiple pod, node and container:

For instance, we have 30 servers in a cluster and we want to monitor CPU utilization and get alerted if the usage goes beyond 70%.
Here is what I’m thinking:
I’ll create a dashboard panel and get all 30 hosts cpu utilization populated in it. Then after, when I configure alerting, it only sends an alert when the first host (say server01) has breached the threshold of 70%. and while the alert on server01 is still up, if any other host (say server 02, 03… ) breaches the threshold, we aren’t getting alerted.

Could you please help us on how to get through with this scenario.
Thanks in advance!

Hi! Have you solved that issue? I somewhat stumbled in the same problem here and melrose seems to be a troll.