Checking service availability and host label

Hello. Please help me figure out the following problem: I use a bundle of telegraf+prometheus+grafana. In telegraf, I use several plugins to collect metrics from software such as postgresql, redis, etc.
Now I am setting up alerts in case of service unavailability, and I cannot understand how to do it correctly? The metric collector does not return availability metrics like “postgresql_up” and therefore I am trying to use the existing metrics for this task. The idea is this - when the metric stops updating, the alert should work. But I ran into a problem with labels: it is extremely important for me to see in the notification what host the alert is for. The simplest solution is to take a certain metric and wrap it with the absent function. But in this case, I lose the labels, and I cannot find out the source host from the notification. It would seem that there is a standard notification in the absence of data - but the problem is the same, there is no host label in the data generated by the alert. The second day I search on the Internet and ask AI - no result yet.

I ask for advice - how can I solve this problem?

Yes, I saw this proposal and tried it. But in the end we have a situation where the relevance of the service state ends after the selected shift time. Is there really no single solution to such a common problem?

Yes, but I guess you won’t like it: create dedicated alert for each host.

1 Like

You could try doing a Range Time query with now-1h setting (so you’d get an hour of data). If your service is active, you should have set number of points (+/- a couple for data that is about to be ingested), so you could (theoretically) use Reduce of Count type and if the datapoints are less than half expected, you could alert on that. A disadvantage - if your setup fails (e.g. telegraf stops sending data, prometheus fails to ingest data, etc.) you’d get the alert. Also, those would probably resolve themselves after one hour, so it might be misleading for a person who handles the alert. If you don’t like that, I guess you only can create dedicated alerts, as Jan said.

1 Like