Hello experts, I have an alert when the CPU utilization > 95%. It worked quite well in 8.x but around Christmas I upgraded to 9.3.2, after which fortunately we had no incidents but today we didn’t receive an alert when one of our servers had 99% utilization and grafana didn’t receive any data. I googled and found this: NO DATA alert is not firing · Issue #60283 · grafana/grafana · GitHub so I upgraded to 9.4.3 but still it doesn’t work.
My query has multiple servers, so I tested with just the server that was at fault and it kind of worked. It alerted once and then went back to Normal (even though there is still No Data), which in itself also is strange. However its not practical for me to create an alert each for all the servers we have
Yes we did a couple of upgrades last year including a legacy to the unified alerting system when were on 8.x. However the alert rule itself is newly made in 9.x
Alert rule looks like this:
100 - (100 * sum by (instance) (irate(windows_cpu_time_total{environment=~“xxx”,hostname!~“yyyy”,mode=“idle”}[2m])) / sum by (instance) (irate(windows_cpu_time_total{environment=~“xxx”,hostname!~“yyyy”}[2m])))
Not sure if it’s relevant, but when I select View rule I get
Query not available
Cannot display the query preview. Some of the data sources used in the queries are not available.
One more thing I noticed (when I tested with individual alert for each server) was for No Data even thought the rule says wait for 3 min, it waits for 10 minutes and then sends the No Data alert. Looks like it has to have the query empty in Query & Results to send the alert
The above should give a better idea of what the expected behaviour will be when the datasource returns no data.
Other information that would help are any logs (might need to enable debug logs) to do with that alert changing states / firing.
One more thing I noticed (when I tested with individual alert for each server) was for No Data even thought the rule says wait for 3 min, it waits for 10 minutes and then sends the No Data alert. Looks like it has to have the query empty in Query & Results to send the alert
As opposed to an alert changing state from Normal to Firing, the timings for when notifications are actually sent are dependent on the notification policy. What are timings and group by for the matching policy?
@mjacobson any update? It is frankly quite easy to simulate the problem. I upgraded to latest grafana and was able to reproduce it again.
I dont want to create 40+ alerts for production and 100+ for test environments. If I am doing it the wrong way, please advice
This is really blocking us from making Grafana our official monitoring tool. Please get back if you plan on looking into it, otherwise I need to look for alternate monitoring tools
Hi! I think there is some confusion about how multi-dimensional alerting works.
When there are multiple series and one of those series “disappears” Grafana does send a No Data alert because one series “disappearing” while the other series still existing means that the “disappeared” series has now resolved. If however the entire query returns no series then the alert is No Data because there are no series of any kind.
This is more or less modelled on how alerting work in Prometheus, and Grafana borrows a lot of design choices from Prometheus. That means a missing series is not a firing alert in Prometheus either.
Resolved is resolved, how can “disappearing”/missing series be resolved? It sounds more like a limitation. Anyways, what you are suggesting is when a server gets so busy that prometheus is not able to scrape that server or server doesnt respond to a scrape, I need to use up function instead. I need to check if that works
If a series “disappears” and never comes back it is resolved. Take for example a K8s pod that has been migrated to another server because the first server has failed. Unless a StatefulSet, the containers created on the other server will have different IDs, while the containers on the original server will “disappear” along with their IDs.
In the case of Prometheus metrics this looks like the old series disappearing (see N9fdv1 and W1NQXN in the following diagram) forever and new series (fHQcDf and oKZt8U) being created.
|---------------| Time ---------------------> |
| Container ID | Metric |
| N9fdv1 | 1 2 3 4 x x x x ... |
| W1NQXN | 1 2 3 4 x x x x ... |
| fHQcDf | x x x x 1 2 3 4 ... |
| oKZt8U | x x x x 1 2 3 4 ... |
If you know that missing series will be reconciled (i.e. the series comes back after 10 minutes) you can increase the time window of your query and even tell Grafana to fill the gaps with 0.
For example, here a series (In Yellow) disappeared for about 3 minutes, but because the window on the alert is 10 minutes, there is still data in the time range. You can increase this window to tolerate mising data up to a known upper bound.
You can also increase the time looked back in the rate (here from 1m to 5m to avoid gaps if you know there is an upper bound on the series coming back).
Detecting when an alert instance disappears because a host has stopped sending data seems like an important feature, is there already a feature request for it?
Do I see this correctly that there are currently three workarounds available?
Set up one rule per host instead of using multidimensional alerts.
Alert when the number of hosts does not match the desired/previous number of hosts.
Fill in the missing up metrics with zeroes in the data source. Does anyone know how to do this with InfluxDB?
This is the correct way in my opinion. You may also be able to write an InfluxDB query to compare the current labels with the labels from 1 hour ago. The query would return the number of labels that are absent but existed 1 hour ago.