Alerting Support for multiple Hosts and Series

Hello Team,

We are monitoring multiple environments(like standalone server, Kubernetes, GCP, and IBM MQ) using Prometheus & Grafana and grafana alerting is being used for Event management. When it comes to alerting side, It will be difficult to create same type of alerting rule for each metric(Like CPU, memory, filesystem, etc). Somehow we managed to achieve this by creating a separate panel for each metric.

But the challenge is there when you place alerts for MQ. Not sure if your team is aware of this. let me try to simplify using cluster example. Let’s say we have many clusters with multiple nodes running there and placed monitoring node wise. In some node, we are monitoring 10 metrics and in some, it is 100. And for each metric, we have to set different types of alerts for different teams.

Example:
if CPU utilization > 60, send an alert to A team
If CPU utilization > 70, send an alert to B team
if CPU utilization > 80, send alert to C team
(In this way we have to set alerts for other metrics(memory, filesystem, etc.) as well. For few metrics, there is 5 rule and for a few it would be 10-20.)

If we follow the above example with the current grafana provisioning setup, we have to create 80-100 panels for one instance. When it comes to monitoring more than 1000 instances, it would be really difficult to set up such alerting system and managing these rules as well. As most of the time requirements will keep change and being asked to change threshold instantly(Like A team reached us to get alert CPU utilization > 65 then we need alert). Even this is a small modification from the threshold point but we have to open each panel and have to make the changes. As lots of manual work been placed, human error will be common for alerting pieces which would not be a reliable solution.

So my question is, Is there any way that we can create an alerting template and can align those to multiple instances. Like: cpu_alerting_60 where we can set up the threshold and place multiple instance details. So, it will replicate the rule across all mentioned instances.

Please provide you valuable suggestion to achieve this setup. Thanks in advance :slight_smile:

hello and welcome to the forums, its a bit much text for a simple question, could you please ask the essense ? thank you

Good Day @melrose,

In short, how to set one rule for multiple instances which can generate event independently?

Is there any way to create alerting templates?

My requirement is, I have to place multiple alert rules for different metrics which would notify various application teams accordingly. When I am talking about multiple, the count should consider as more than 1000.

[quote=“srikantasahu, post:3, topic:42693”]
about multiple , the count should consider as more than 1000 .
[/quote

ok,not quite clear. could you be a bit more precise?

I mean it to a huge number of instances, where we have to place same types of alerting. If we will use special character in query to monitor all instances in same panel. it will alert for once, in consecutive time if any other instance will be alerted will not trigger a new event. So i am looking for a solution, how we can place alerting solution for same metrics accross all instances?

i hope this clarified my question :slight_smile:

[quote=“srikantasahu, post:5, topic:42693”]
i hope this clarified my question
[/quot
sorry no

Let me try to brief this again using snap snippet. So here, i am monitoring most of the pod in one panel(let’s say we are monitoring cpu utilization here for 80%). How i will setup the alerting solution here, so it will trigger an individual event for each pod?

looks strange to me is this grafana ?

im the senor technical expert here and we try to help you as best as possible. please fill out the support template first :

thank you

Sorry, i was not able to find the templates here. Can you please share the details or any link that i can follow? Else, can i use github support template here?

another snapshot for this, where we are monitoring multiple pod, node and container:

yes sure the template is:
Please fill out
-number of years experience in pc science
-number of years experience in grafana
-what did you do to resolve the problem
-which other forums did you try to get help and give the link to the forum
-what grafana version is the error and what versions not (try all major and minor versions)
-steps to reproduce
-excluded errors and reasons why

Below are the details:
-number of years experience in pc science- 9 yr
-number of years experience in grafana - 6 month
-what did you do to resolve the problem- check-in logfiles, following documentation and other support sites etc.
-which other forums did you try to get help and give the link to the forum: gihub but they closed and redirected to here
-what grafana version is the error and what versions not (try all major and minor versions): i think this is not an error, we have to find out a solution for this. (Current:7.3.1)
-steps to reproduce
create panel to monitor one metric for multiple instances and place an alert (like: cpu_usage_seconds_total{instance=“server_name*”})
-excluded errors and reasons why
While we place alerting for mentioned panel, it will trigger an alert if TH crossed. Asume that alert is in queue and meantime other instance breached the TH, it will not trigger a new event. To avoid this situation we have to create multiple panel for all instances.(Like: cpu_usage_seconds_total{instance=“server_01”}, cpu_usage_seconds_total{instance=“server_02”}, cpu_usage_seconds_total{instance=“server_03”} and so on…)

For instance, we have 30 servers in a cluster and we want to monitor CPU utilization and get alerted if the usage goes beyond 70%.
Here is what I’m thinking:
I’ll create a dashboard panel and get all 30 hosts cpu utilization populated in it. Then after, when I configure alerting, it only sends an alert when the first host (say server01) has breached the threshold of 70%. and while the alert on server01 is still up, if any other host (say server 02, 03… ) breaches the threshold, we aren’t getting alerted.

Could you please help us on how to get through with this scenario.
Thanks in advance!

ok, do you have a budget ?