Can Grafana fire an alert on the basis of majority of the times my query is violated?

Hi all,

Example;

  1. Let’s say that I have a metric which shows me CPU Utilization and if the utilization crosses 80%, an alert should be fired.

  2. I have created an alert which evaluates my query every 1m for 5m.

  3. This basically means that my query will be evaluated 5 times in 5 mins (1m + 1m + 1m + 1m + 1m) before it sends an alert to me.

  4. Now, let’s say that for the first 3 evaluation, CPU utilization was above 80% but for the next 2 evaluation, it went down to 60%. So, this tells me that in last 5 mins, for the majority of the time (3 out of 5 times) my CPU utilization was more than I wanted.

  5. What I need is to find a way that Grafana fires an alert to me on the basis on that majority. If 3 out of 5 times, my condition is violated, then fire an alert otherwise I don’t need any alert.

Is this possible? Can anyone help me out here?

This is more or less what the average function does in a Reduce expression, but instead of looking at the past 5 evaluations, it takes the avarage of each data point returned by the query. This means that you can write a query that averages the CPU usage of the last 5 minute, in 1 minute intervals, and then alert on the average of the average.

Hi George,

Thanks for the response but it seems I need to write my query for dashboard again. I am looking if Alert’s evaluation behavior can do this?

Hi! You might need to change some of the dashboard query when writing your alert query as queries cannot always be copied 1:1 if the alert needs to do something different from the visualization.

Hi @georgerobinson I have written (1 - avg(rate(node_cpu_seconds_total{mode=“idle”}[$__rate_interval])) by (instance)) * 100 which finds out my CPU utilization %

I have put an alert query on this by using Reduce and Math function and then set the alert evaluation period for every 1m for 5m.

Now can you please help me where to change the query in order to receive the outcome?

Hi! You can edit the alert rule either from the dashboard or the alert rules page, and then change the query under Set a query and alert condition.

1 Like

Thanks @georgerobinson will try this. Also, is there any way in Grafana through which we can count the number of times my threshold was violated in a day?

Hi @georgerobinson

  1. Right now, the query of my panel is avg by (instance) (rate(node_cpu_seconds_total{mode=“idle”}[$__rate_interval])) * 100 which tells me the idle % of my CPU.

  2. I have put an alert with the help of reduce function: -

With the help of this, I am able to get the count of all the data points that can be seen in the specified time range (i.e., 5 mins which I have set).

  1. But I don’t want to see all the data points here. As you can see that I have put a threshold at 99.78%, I just want to see the count/number of times my time series data crossed that threshold and then put an alert of that threshold.

  2. So that the alert can notify me how many times in last 5 mins my threshold has been violated.

  3. I have also tried to use Classic condition but the output was 0.

  1. Next, I tried to write a query which simply counts my id% query but it gave me 100 as output.

count (avg by (instance) (rate(node_cpu_seconds_total{mode=“idle”}[$__rate_interval]))) 100 99.78*

  1. Can you help me if there is something I am missing here?