How to get an alert for Proxmox CPU usage

alsturf · March 31, 2024, 2:04pm

First, I am new to Grafana so be patient. I have reviews many documents as well as videos but they seem to be using early versions of Grafana - mine is 10.3.1. That being said, I am trying to get an alert when my Proxmox VE goes over 20% for example.

The alert query I am using is:
from(bucket: “proxmoxve1”)
|> range(start: v.timeRangeStart, stop:v.timeRangeStop)
|> filter(fn: (r) =>
r._measurement == “cpustat” and
r._field == “cpu”
)
|> filter(fn: (r) => r[“host”] == “pve”)
|> aggregateWindow(every: v.windowPeriod, fn: mean)

Expression B is:

Expression C is:

Expression D is:

Output is:

I have changed to values many times and can’t seem to get the results I need.

mcbrineellis · April 3, 2024, 6:29pm

Apologies, deleted my post as I used the wrong account

I didn’t test this but, but I’m thinking maybe the issue is your aggregateWindow setting? You have fn:mean set which could be averaging the values out over the time window.

Maybe try fn: max ?

alsturf · April 3, 2024, 6:46pm

Thanks for the response. I changed it as you suggested. I set the value to 2 and am seeing this:

I am expecting a “whole” number like this:

mcbrineellis · April 3, 2024, 7:57pm

Hmm, doesn’t look like any of the values you have seen were above 0.2 in either graph you provided…
I’m also curious why you have an “expression D” for math in your alert config.

Can you double check the threshold you’ve got configured?

I just did a little test, to demonstrate. My ESXi host is UP, so the ping value is returning “1”. I set the threshold to “is above 2” and it shows normal. When I set it to “is above 0” it shows the alert is firing.

You can adjust the threshold, and then click “Preview” and it will tell you the number of values that your alert would be “firing” for.

Please play around with this and find the right setting.

alsturf · April 3, 2024, 8:16pm

I am working on your suggestions. Why does the left side of your graph show “whole” numbers and mine decimals like 0.1 and 0.2 etc.

grant2 · April 4, 2024, 12:24am

Hi @alsturf

How often are you recording data points? Every second? Every minute?

Can you give the above in terms of a time window, e.g. “I am trying to get an alert when my average Proxmox VE value goes over 20% within a 15-second time window”?

As @mcbrineellis aluded to, your aggregateWindow statement should be changed to reflect your desired alert state, e.g.

from(bucket: "proxmoxve1")
|> range(start: v.timeRangeStart, stop:v.timeRangeStop)
|> filter(fn: (r) => r._measurement == "cpustat" and r._field == "cpu")
|> filter(fn: (r) => r["host"] == "pve")
|> aggregateWindow(every: 15s, fn: mean)

Can you post a sample output (in tabular form) of the above query? That will help answer the question you asked about your output vs that of @mcbrineellis

In your alert, I think you can remove Expression D because it’s not being used anywhere in the alert.

alsturf · April 4, 2024, 9:37am

Hello @Grant2,

Here is what I am trying to accomplish. I am trying to get an alert when my max Proxmox VE value goes over 50% within a 60 second time window.

alsturf · April 4, 2024, 8:46pm

What I am trying to get is an alert if the CPU value is over 20% as in this image:

If I right click on that panel and select a new alert it loads this query:
from(bucket: “proxmoxve1”)
|> range(start: v.timeRangeStart, stop:v.timeRangeStop)
|> filter(fn: (r) =>
r._measurement == “cpustat” and
r._field == “cpu”
)
|> filter(fn: (r) => r[“host”] == “pve”)
|> aggregateWindow(every: v.windowPeriod, fn: mean)

But when I preview the alert I see this:

Why don’t I see the values like shown in the first image?

alsturf · April 7, 2024, 11:21am

Any help anyone?

I can’t figure out why this is so difficult. I just need an alert when CPU usage goes over a specified number.

Thank you…

grant2 · April 7, 2024, 1:09pm

Hi @alsturf

Does this blog post help?