Why isn't Alert working?

hack3rcon · December 30, 2024, 12:03pm

Hello,
I want to create an alert for CPU usage. I installed the prometheus-alertmanager package and modified the Prometheus configuration file (prometheus.yml) as follows:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "cpu.yml"

The contents of the cpu.yml file are as follows:

groups:
  - name: CPU_Usage_Alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 * (1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))) > 50
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "High CPU usage detected on {{ $labels.instance }}"
          description: "CPU usage on instance {{ $labels.instance }} has been above 50% for the last 1 minutes."

As you can see in the figure below, this rule has been added in the Alert rules section:

Then I created a new dashboard with a query (ALERTS{alertstate="firing"}) like the one below:

As you can see, no alert is given:

What is wrong?

Thank you.

dawiddebowski · December 30, 2024, 4:02pm

Hi,

Was this alert ever firing? If not, then it wouldn’t be displayed.

hack3rcon · December 30, 2024, 5:04pm

Hi,
Thank you so much for your reply.
How to check it? I don’t have access to the server right now, but I’ll check tomorrow.
I increased the CPU usage using the CpuStres tool.

Thank you.

dawiddebowski · December 30, 2024, 9:26pm

You can go into view alert rule (the eye icon next to more button on this screen

(top right corner), then go to the history tab

as you can see, my alert was firing (it was in Alerting state). If yours didn’t, then there would be no “firing” state in the metric (as far as I can see in my instance, there are only pending and firing states reported to the metric, so as long as your alert didn’t at least be on yellow Pending there won’t be any sign in this metric).

hack3rcon · December 31, 2024, 6:10am

Hi,
Thanks again.
I can’t see any history tab:

I guess its location has changed:

Right?

dawiddebowski · December 31, 2024, 8:17am

I don’t think so. The second screen is the Query history - what you’ve run. I meant alert history - the history of the alert states that Grafana runs on its own.
I think that you might not have the history tab because either:

You have no history to be displayed.
You have no alert instances (notice that your query doesn’t return any data, so there are no alert instances, even though you have alert rule).

Alert instance vs alert rule

Alert rule is what you’ve created - a description of what Grafana has to do in order to check. You can think of it as a Java class. Alert instance is a specific instance of the alert (there can be multiple). You can think of them as created objects.

In your case, you have a rule that states for each node instance, create an alert if something specific occurs. For each instance, there should be one alert created. In your query there’s also > 50 limit, so that might be why you don’t have any instances.

Can you try and do > bool(50)? (so your expression would be 100 * (1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))) > bool(50) ). That way, you’ll get 0 when the query is below or equal 50 and 1 when it’s above 50. Grafana by default should alert on 1, so there’s nothing you risk with the logic itself but the alert should produce data and, hopefully, history.

hack3rcon · December 31, 2024, 10:53am

Thanks.
I did:

Why {instance="localhost:9100"}?

I need theses:

What is wrong?

dawiddebowski · December 31, 2024, 2:07pm

As far as I know node_cpu_seconds_total is a metric from node exporter, and you’re passing a screen from windows exporter which (I might be mistaken, since I didn’t find the list of metrics) might not have that metric. Therefore instance label is probably from the config from node exporter.

hack3rcon · January 1, 2025, 5:58am

Hi,
Thanks again.
Yes, I want to check the clients (Windows exporter dashboard), not the server itself. So should I change the node_cpu_seconds_total value to something for the Windows exporter dashboard? Where can I find this value?

hack3rcon · January 1, 2025, 7:34am

Hi,
I created an alert like the following:

Is this rule OK?

I put a stress test program on the CPU and a red heart appears above the CPU Utilization section:

I stopped the stress test program but the red heart is still there. Why?

dawiddebowski · January 1, 2025, 12:38pm

It’s hard to tell if the rule is ok, since I’ve never used windows exporter - I don’t know anything about the metrics from there, but the plot seems alright.

The heart is red because you’re using Max reduce function - you gather the data from the last three hours and then (In reduce expression) get the max of those values which (according to the plot) was way above 50%. By default (and what I think would suit your case) would be to use Last function instead of Max function in reduce expression in alert (expression B) OR using instant query instead of Range type query (right below the query itself and above Add query button there’s Options toggle, click on it and change query type from Range to Instant). Instant query will take the last point of the query, it will be less resource-heavy on the prometheus instance.

hack3rcon · January 1, 2025, 3:57pm

Hi,
Thanks again.
1- Can you write the complete query for me?

2- Does Grafana have the ability to display a warning message as a pop-up message in the Grafana environment when the rule condition is true?

3- How do I show real-time graphs of CPU, Memory, etc. usage in Grafana?

dawiddebowski · January 2, 2025, 11:09am

I’m not sure, as I told, I’ve never used windows exporter metrics. From what I can see the query seems fine, the panel also shows fine results, so it is probably ok(-ish). The alert should look like this though

image1171×570 49.4 KB

(Last instead of Max and you should be fine).

I don’t think I understand your question. You can click preview button and if there’re results, you should be fine, if not, there is either No data or errors displayed.
What do you mean by real time? I heard about streaming capabilities of datasources but I’ve never used it (nor researched it). You can try to set autorefresh on your dashboard

image535×215 11.5 KB

but I don’t think that’s what you’re looking for.

hack3rcon · January 2, 2025, 12:38pm

Hello,
HNY.
Thanks again.
1- I don’t have access to the server right now. I’ll test and report the results.

2- I mean, when an alert is true (for example, CPU usage is above 50%), a message should appear at the top of the Grafana screen in addition to the red heart. Is this possible?

3- By real-time I mean something like GNOME System Monitor in Debian or Task Manager in Windows OS. They show the system’s performance in real time.

dawiddebowski · January 2, 2025, 6:34pm

I don’t think it is. You can do that yourself by adding a description to your panel like “If the heart is broken and red, ” but I don’t know of such a feature.
That didn’t really help I guess you’d need some streaming but I don’t think prometheus supports that (since it scrapes metrics, so they are not real time). Anyway, I don’t think real-time metrics is a good idea, since the system needs to be asked for its metrics, processing takes time, etc. etc. So not in prometheus I think.

hack3rcon · January 4, 2025, 6:01am

Hello,
Thank you so much.
I changed the alert configuration as follows:

But the red heart still exists. Why?

dawiddebowski · January 4, 2025, 5:31pm

Tbh your guess is just as good as mine with the info I have - not enough time (group evaluation) has passed, another alert is pinned to the panel, the alert should be firing, a bug in Grafana - I’m not sure. The alert is fine now, when you click preview, do you see Firing in red or Normal in green?

hack3rcon · January 5, 2025, 5:51am

Hi,
Thanks again.
Yes:

And:

Topic		Replies	Views
Alerts configuration Alerting query-help , alert-notifications	3	800	December 14, 2022
Grafana alerting Evaluation Alerting alerting , grafana	6	2996	April 14, 2023
Grafana alerts notification Alerting alerting , alert-notifications	0	453	March 27, 2023
Grafana Alerts Metrics isn't returning expected value Alerting metrics	1	99	January 2, 2025
Grafana cpu usage of specific instance alert Time Series Panel alerting	6	2582	September 21, 2023

Why isn't Alert working?

Alert instance vs alert rule

Related topics