Node down alert - Grafana

Hi,

Is it possible to monitor node status with grafana? Say for example if the node down grafana should trigger alert.

I am looking for ICMP Ping when the nodes not pinging grafana should trigger alert.

We are using combination telegraf+influxdb+grafana stack.

Regards
Kumar

I use Telegraf ping plugin to ping the ip addresses of a few hosts and write the response to InfluxDB “ping” measurement. I currently use the “default” ping method rather than the “native” ping method of the Telegraf ping plugin (see ping plugin documentation) since there was a recent issue with the Native Go Ping method, but this may now have been fixed)

I then simply use a grafana graph to graph the ping table results and (email/slack) alert when the average ping result is above 0 over a 5 minute period - The Telegraf ping result is 0 if online and 1 if not, so the alert only triggers if the node is offline for >5 mins but this is configurable. I also then get an alert when the host comes back online.

Hi,

Thanks for the update.

I tried it works well however grafana sends mail after 5 to 8 mins when the node offline.

This becomes problem especially for production environment if the node goes down should get an mail within 1 or 2mins.

Regards
Kumar

Hi,

I modified the condition for value to 0 which triggers an alert asap when node goes down.

image

Regards
Kumar

Hello all I know its been a while on this one,

I am looking to do a similar thing, where in the Telegraf config do you apply the IP address to ping?
Is it within the

[[inputs.ping]]

Hosts to send ping packets to.

urls = [“example.org”]

part?

Sorry a little confused

Has anyone else used ping?
any guidance?

I just did this yesterday. You simply put in the host name or ID with quotes separated by commas. For example:

urls = [“google.com”,“10.0.0.1”]

Thanks for the info much appreciated, have you setup the alerting within Grafana for this?

I have set a query for the panel

from(bucket: “telegraf”)
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: ® => r["_measurement"] == “ping”)
|> filter(fn: ® => r["_field"] == “reply_received”)
|> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
|> yield(name: “mean”)

And the alert does not fire as this covers a few hosts - when the host drops it does not change the metric (tried by stopping telegraf service as this would replicate down) do you have to set this per host?