Hello,
I want to create an alert for CPU usage. I installed the prometheus-alertmanager package and modified the Prometheus configuration file (prometheus.yml
) as follows:
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "cpu.yml"
The contents of the cpu.yml
file are as follows:
groups:
- name: CPU_Usage_Alerts
rules:
- alert: HighCPUUsage
expr: 100 * (1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))) > 50
for: 1m
labels:
severity: critical
annotations:
summary: "High CPU usage detected on {{ $labels.instance }}"
description: "CPU usage on instance {{ $labels.instance }} has been above 50% for the last 1 minutes."
As you can see in the figure below, this rule has been added in the Alert rules section:
Then I created a new dashboard with a query (ALERTS{alertstate="firing"}
) like the one below:
As you can see, no alert is given:
What is wrong?
Thank you.
Hi,
Was this alert ever firing? If not, then it wouldn’t be displayed.
1 Like
Hi,
Thank you so much for your reply.
How to check it? I don’t have access to the server right now, but I’ll check tomorrow.
I increased the CPU usage using the CpuStres tool.
Thank you.
You can go into view alert rule (the eye icon next to more
button on this screen
(top right corner), then go to the history tab
as you can see, my alert was firing (it was in Alerting
state). If yours didn’t, then there would be no “firing” state in the metric (as far as I can see in my instance, there are only pending
and firing
states reported to the metric, so as long as your alert didn’t at least be on yellow Pending
there won’t be any sign in this metric).
1 Like
Hi,
Thanks again.
I can’t see any history tab:
I guess its location has changed:
Right?
I don’t think so. The second screen is the Query history - what you’ve run. I meant alert history - the history of the alert states that Grafana runs on its own.
I think that you might not have the history tab because either:
- You have no history to be displayed.
- You have no alert instances (notice that your query doesn’t return any data, so there are no alert instances, even though you have alert rule).
Alert instance vs alert rule
Alert rule is what you’ve created - a description of what Grafana has to do in order to check. You can think of it as a Java class. Alert instance is a specific instance of the alert (there can be multiple). You can think of them as created objects.
In your case, you have a rule that states for each node instance, create an alert if something specific occurs. For each instance, there should be one alert created. In your query there’s also > 50
limit, so that might be why you don’t have any instances.
Can you try and do > bool(50)
? (so your expression would be 100 * (1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))) > bool(50)
). That way, you’ll get 0
when the query is below or equal 50 and 1
when it’s above 50. Grafana by default should alert on 1, so there’s nothing you risk with the logic itself but the alert should produce data and, hopefully, history.
1 Like
Thanks.
I did:
Why {instance="localhost:9100"}
?
I need theses:
What is wrong?
As far as I know node_cpu_seconds_total
is a metric from node exporter, and you’re passing a screen from windows exporter which (I might be mistaken, since I didn’t find the list of metrics) might not have that metric. Therefore instance
label is probably from the config from node exporter.
1 Like
Hi,
Thanks again.
Yes, I want to check the clients (Windows exporter dashboard), not the server itself. So should I change the node_cpu_seconds_total
value to something for the Windows exporter dashboard? Where can I find this value?
Hi,
I created an alert like the following:
Is this rule OK?
I put a stress test program on the CPU and a red heart appears above the CPU Utilization section:
I stopped the stress test program but the red heart is still there. Why?
It’s hard to tell if the rule is ok, since I’ve never used windows exporter - I don’t know anything about the metrics from there, but the plot seems alright.
The heart is red because you’re using Max
reduce function - you gather the data from the last three hours and then (In reduce expression) get the max of those values which (according to the plot) was way above 50%. By default (and what I think would suit your case) would be to use Last
function instead of Max
function in reduce expression in alert (expression B) OR using instant query instead of Range type query (right below the query itself and above Add query
button there’s Options toggle, click on it and change query type from Range to Instant). Instant query will take the last point of the query, it will be less resource-heavy on the prometheus instance.
1 Like
Hi,
Thanks again.
1- Can you write the complete query for me?
2- Does Grafana have the ability to display a warning message as a pop-up message in the Grafana environment when the rule condition is true?
3- How do I show real-time graphs of CPU, Memory, etc. usage in Grafana?
- I’m not sure, as I told, I’ve never used windows exporter metrics. From what I can see the query seems fine, the panel also shows fine results, so it is probably ok(-ish). The alert should look like this though
(Last instead of Max and you should be fine).
-
I don’t think I understand your question. You can click preview button and if there’re results, you should be fine, if not, there is either No data or errors displayed.
-
What do you mean by real time? I heard about streaming capabilities of datasources but I’ve never used it (nor researched it). You can try to set autorefresh on your dashboard
but I don’t think that’s what you’re looking for.
1 Like
Hello,
HNY.
Thanks again.
1- I don’t have access to the server right now. I’ll test and report the results.
2- I mean, when an alert is true (for example, CPU usage is above 50%), a message should appear at the top of the Grafana screen in addition to the red heart. Is this possible?
3- By real-time I mean something like GNOME System Monitor in Debian or Task Manager in Windows OS. They show the system’s performance in real time.
Hello,
Thank you so much.
I changed the alert configuration as follows:
But the red heart still exists. Why?
Tbh your guess is just as good as mine with the info I have - not enough time (group evaluation) has passed, another alert is pinned to the panel, the alert should be firing, a bug in Grafana - I’m not sure. The alert is fine now, when you click preview, do you see Firing
in red or Normal
in green?
1 Like