we are monitoring our infrastructure using telegraf, InfluxDB and grafana. We also use the alerts from grafana. One graph that we also use for alerts shows the memory usage of our systems grouped by host using a query like this:
SELECT mean("used_percent") FROM "mem" WHERE $timeFilter GROUP BY time(1m), "host" fill(null)
When configuring the alert there is an option to trigger an alert when no data is received. The problem is that this alert would only be triggered when all hosts don’t send data any more.
Is there a possibility to configure the alert to trigger also when only one single host stops sending metrics other than using a single graph + alert per host ?
That’s not possible at the moment. single graph + alert per host is the way to go for now I think. However if you have a finite set of hosts that you know should report metrcis you should be able to setup a single graph with an alert that will trigger when you don’t get any result from one or more of the hosts, e.g. you expect do get a count of 10, but receives 9 = alert.
There are different sorts of feature request issues for alerting at Grafana GitHub repo. Feel free to upvote any feature requests you’re interested in.
I solved it the following way in the meantime: As we have one central metrics server I configured the ping plugin for all of our host ip’s (generated via ansible) and added one graph with the query:
SELECT mean("percent_packet_loss") FROM "ping" WHERE "stage" = 'prod' AND $timeFilter GROUP BY time(1m), "url" fill(null)
and now have an alert configured when the value is bigger than 90% for some period of time. Is not the same as “detect when hosts send no data anymore” but works quite well for hosts being down in general.
we would like to introduce you to how we tackle the problem of detecting data staleness altogether. I recognize this solution is painted in a different color but nevertheless might also resonate with you.