Why isn't Alert working?

Hello,
I want to create an alert for CPU usage. I installed the prometheus-alertmanager package and modified the Prometheus configuration file (prometheus.yml) as follows:

alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "cpu.yml"

The contents of the cpu.yml file are as follows:

groups:
  - name: CPU_Usage_Alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 * (1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))) > 50
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "High CPU usage detected on {{ $labels.instance }}"
          description: "CPU usage on instance {{ $labels.instance }} has been above 50% for the last 1 minutes."

As you can see in the figure below, this rule has been added in the Alert rules section:

Then I created a new dashboard with a query (ALERTS{alertstate="firing"}) like the one below:

As you can see, no alert is given:

What is wrong?

Thank you.

Hi,

Was this alert ever firing? If not, then it wouldn’t be displayed.

1 Like

Hi,
Thank you so much for your reply.
How to check it? I don’t have access to the server right now, but I’ll check tomorrow.
I increased the CPU usage using the CpuStres tool.

Thank you.

You can go into view alert rule (the eye icon next to more button on this screen

(top right corner), then go to the history tab

as you can see, my alert was firing (it was in Alerting state). If yours didn’t, then there would be no “firing” state in the metric (as far as I can see in my instance, there are only pending and firing states reported to the metric, so as long as your alert didn’t at least be on yellow Pending there won’t be any sign in this metric).

1 Like

Hi,
Thanks again.
I can’t see any history tab:

I guess its location has changed:

Right?

I don’t think so. The second screen is the Query history - what you’ve run. I meant alert history - the history of the alert states that Grafana runs on its own.
I think that you might not have the history tab because either:

  1. You have no history to be displayed.
  2. You have no alert instances (notice that your query doesn’t return any data, so there are no alert instances, even though you have alert rule).

Alert instance vs alert rule

Alert rule is what you’ve created - a description of what Grafana has to do in order to check. You can think of it as a Java class. Alert instance is a specific instance of the alert (there can be multiple). You can think of them as created objects.

In your case, you have a rule that states for each node instance, create an alert if something specific occurs. For each instance, there should be one alert created. In your query there’s also > 50 limit, so that might be why you don’t have any instances.

Can you try and do > bool(50)? (so your expression would be 100 * (1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[1m]))) > bool(50) ). That way, you’ll get 0 when the query is below or equal 50 and 1 when it’s above 50. Grafana by default should alert on 1, so there’s nothing you risk with the logic itself but the alert should produce data and, hopefully, history.

1 Like

Thanks.
I did:

Why {instance="localhost:9100"}?

I need theses:

What is wrong?

As far as I know node_cpu_seconds_total is a metric from node exporter, and you’re passing a screen from windows exporter which (I might be mistaken, since I didn’t find the list of metrics) might not have that metric. Therefore instance label is probably from the config from node exporter.

1 Like

Hi,
Thanks again.
Yes, I want to check the clients (Windows exporter dashboard), not the server itself. So should I change the node_cpu_seconds_total value to something for the Windows exporter dashboard? Where can I find this value?

Hi,
I created an alert like the following:

Is this rule OK?

I put a stress test program on the CPU and a red heart appears above the CPU Utilization section:

I stopped the stress test program but the red heart is still there. Why?

It’s hard to tell if the rule is ok, since I’ve never used windows exporter - I don’t know anything about the metrics from there, but the plot seems alright.

The heart is red because you’re using Max reduce function - you gather the data from the last three hours and then (In reduce expression) get the max of those values which (according to the plot) was way above 50%. By default (and what I think would suit your case) would be to use Last function instead of Max function in reduce expression in alert (expression B) OR using instant query instead of Range type query (right below the query itself and above Add query button there’s Options toggle, click on it and change query type from Range to Instant). Instant query will take the last point of the query, it will be less resource-heavy on the prometheus instance.

1 Like

Hi,
Thanks again.
1- Can you write the complete query for me?

2- Does Grafana have the ability to display a warning message as a pop-up message in the Grafana environment when the rule condition is true?

3- How do I show real-time graphs of CPU, Memory, etc. usage in Grafana?

  1. I’m not sure, as I told, I’ve never used windows exporter metrics. From what I can see the query seems fine, the panel also shows fine results, so it is probably ok(-ish). The alert should look like this though

(Last instead of Max and you should be fine).

  1. I don’t think I understand your question. You can click preview button and if there’re results, you should be fine, if not, there is either No data or errors displayed.

  2. What do you mean by real time? I heard about streaming capabilities of datasources but I’ve never used it (nor researched it). You can try to set autorefresh on your dashboard

but I don’t think that’s what you’re looking for.

1 Like

Hello,
HNY.
Thanks again.
1- I don’t have access to the server right now. I’ll test and report the results.

2- I mean, when an alert is true (for example, CPU usage is above 50%), a message should appear at the top of the Grafana screen in addition to the red heart. Is this possible?

3- By real-time I mean something like GNOME System Monitor in Debian or Task Manager in Windows OS. They show the system’s performance in real time.

  1. I don’t think it is. You can do that yourself by adding a description to your panel like “If the heart is broken and red, ” but I don’t know of such a feature.
  2. That didn’t really help :smile: I guess you’d need some streaming but I don’t think prometheus supports that (since it scrapes metrics, so they are not real time). Anyway, I don’t think real-time metrics is a good idea, since the system needs to be asked for its metrics, processing takes time, etc. etc. So not in prometheus I think.
1 Like

Hello,
Thank you so much.
I changed the alert configuration as follows:

But the red heart still exists. Why?

Tbh your guess is just as good as mine with the info I have - not enough time (group evaluation) has passed, another alert is pinned to the panel, the alert should be firing, a bug in Grafana - I’m not sure. The alert is fine now, when you click preview, do you see Firing in red or Normal in green?

1 Like

Hi,
Thanks again.
Yes:

And: