Hello experts, I have an alert when the CPU utilization > 95%. It worked quite well in 8.x but around Christmas I upgraded to 9.3.2, after which fortunately we had no incidents but today we didn’t receive an alert when one of our servers had 99% utilization and grafana didn’t receive any data. I googled and found this: NO DATA alert is not firing · Issue #60283 · grafana/grafana · GitHub so I upgraded to 9.4.3 but still it doesn’t work.
My query has multiple servers, so I tested with just the server that was at fault and it kind of worked. It alerted once and then went back to Normal (even though there is still No Data), which in itself also is strange. However its not practical for me to create an alert each for all the servers we have
Hi @ravikiranswe! A couple questions might help get to the bottom of this:
- Did you migrate from legacy alerting to the new Grafana Alerting during any of these upgrades or were you already using Grafana Alerting on 8.x?
- Can you share your alert rule configuration? Either with a screenshot or an export.
Nothing stands out from the query itself, I’ll probably need more information on the rest of the alert rule definition. Things such as:
- If this is a Grafana-managed alert then the rest of the expressions / alert condition. Is it a classic condition, reduce, something else.
- The configured No Data behaviour:
The above should give a better idea of what the expected behaviour will be when the datasource returns no data.
Other information that would help are any logs (might need to enable debug logs) to do with that alert changing states / firing.
One more thing I noticed (when I tested with individual alert for each server) was for No Data even thought the rule says wait for 3 min, it waits for 10 minutes and then sends the No Data alert. Looks like it has to have the query empty in Query & Results to send the alert
As opposed to an alert changing state from Normal to Firing, the timings for when notifications are actually sent are dependent on the notification policy. What are timings and group by for the matching policy?
Query not available bug, this has been fixed in v9.4.7: https://github.com/grafana/grafana/pull/64198
Attaching screenshot of another example rule
- So this query returns the number of processors for 8 of our test servers
- Next I logged into one of the servers and stopped the windows_exporter (simulating a high CPU)
- Well then nothing happened. No alert was ever sent
- Timing is it runs every 1 min and waits for 3 minutes. There are no other nested stuff anywhere
- After another 15 minutes I checked and the server just disappeared from the list, that’s it.
- My expectation was that I would receive a No Data alert, which works when there is only 1 server in the alert condition
@mjacobson any update? It is frankly quite easy to simulate the problem. I upgraded to latest grafana and was able to reproduce it again.
I dont want to create 40+ alerts for production and 100+ for test environments. If I am doing it the wrong way, please advice
This is really blocking us from making Grafana our official monitoring tool. Please get back if you plan on looking into it, otherwise I need to look for alternate monitoring tools
Hi! I think there is some confusion about how multi-dimensional alerting works.
When there are multiple series and one of those series “disappears” Grafana does send a No Data alert because one series “disappearing” while the other series still existing means that the “disappeared” series has now resolved. If however the entire query returns no series then the alert is No Data because there are no series of any kind.
This is more or less modelled on how alerting work in Prometheus, and Grafana borrows a lot of design choices from Prometheus. That means a missing series is not a firing alert in Prometheus either.
Perhaps you can try something like this (Absent Alerting for Scraped Metrics – Robust Perception | Prometheus Monitoring Experts) and should work in both Prometheus and Grafana!
Resolved is resolved, how can “disappearing”/missing series be resolved? It sounds more like a limitation. Anyways, what you are suggesting is when a server gets so busy that prometheus is not able to scrape that server or server doesnt respond to a scrape, I need to use up function instead. I need to check if that works
If a series “disappears” and never comes back it is resolved. Take for example a K8s pod that has been migrated to another server because the first server has failed. Unless a
StatefulSet, the containers created on the other server will have different IDs, while the containers on the original server will “disappear” along with their IDs.
In the case of Prometheus metrics this looks like the old series disappearing (see
W1NQXN in the following diagram) forever and new series (
oKZt8U) being created.
|---------------| Time ---------------------> |
| Container ID | Metric |
| N9fdv1 | 1 2 3 4 x x x x ... |
| W1NQXN | 1 2 3 4 x x x x ... |
| fHQcDf | x x x x 1 2 3 4 ... |
| oKZt8U | x x x x 1 2 3 4 ... |
If you know that missing series will be reconciled (i.e. the series comes back after 10 minutes) you can increase the time window of your query and even tell Grafana to fill the gaps with
For example, here a series (In Yellow) disappeared for about 3 minutes, but because the window on the alert is 10 minutes, there is still data in the time range. You can increase this window to tolerate mising data up to a known upper bound.
You can also increase the time looked back in the rate (here from
5m to avoid gaps if you know there is an upper bound on the series coming back).
Detecting when an alert instance disappears because a host has stopped sending data seems like an important feature, is there already a feature request for it?
Do I see this correctly that there are currently three workarounds available?
- Set up one rule per host instead of using multidimensional alerts.
- Alert when the number of hosts does not match the desired/previous number of hosts.
- Fill in the missing
up metrics with zeroes in the data source. Does anyone know how to do this with InfluxDB?
This is the correct way in my opinion. You may also be able to write an InfluxDB query to compare the current labels with the labels from 1 hour ago. The query would return the number of labels that are absent but existed 1 hour ago.
I have drafted a feature request to handle this more conveniently.