How can I create a metric and alert to capture if a server is up or down?

My goal is to be alerted when a server is down.

The plan was to produce a metric from the server to capture up status and configure the Grafana Agent to scrape it. Then I’d make an alert that worked off that metric. The idea was that If the metric was missing, like if the server, agent, or whole node shutdown, then it would fire the alert because the metric is no longer being reported.

Is there a better way to be going about this?

So far I’ve gotten the Grafana Agent to scrape a metric from the server called process_start_time_seconds and query that value in the Explore page.

The problem I’ve noticed is that metric still appears ten minutes after the server and agent have been shutdown even when using an instant query. I imagine this will cause a big problem for the alert where it thinks the server is still running for a long time after it has shutdown or crashed.

Is there a way to turn off this delay in the metric disappearing, or a better way of doing this all together?

That’s Prometheus “feature” (not bug). But you are alerting on metric values (e.g. up metric is 0), not on existence of time series (up metric doesn’t exist), so that’s not a problem.

Thanks for the reply.

Are you saying that I should not use “metric doesn’t exist” for the alert? That instead the alert should check if a metric is 1 or 0?

Yes, you should.

1 Like

Hi @Dylan123

Not sure if you are using Grafana OSS or Cloud or if your server has a public endpoint, but Cloud supports synthetic monitoring, which is dead simple to set up and can also be alerted on.

More here:

Thanks for the suggestion. This will probably work.

You know, I tried using Synthetic Monitoring once for this but ran into an issue where the most frequent I could setup tests to run was once an hour. But I think I was looking at the K6 docs in reality, not this official “Synthetic Monitoring” offering/tool.

There in the K6 use cases it says K6 can be used for synthetic monitoring and testing the availability of a production environment. Maybe I ended up trying to setup K6 to test for up/down status, ran into that frequency constraint of at most once an hour, then categorized synthetic monitoring as not what I’m looking for. After that, I didn’t think of looking up anything about Grafana Synthetic Monitoring again.

Thanks for suggesting the alternative approach.