I have multiple hosts, each one sending disk S.M.A.R.T data to InfluxDB using Telegraf input.smart
plugin.
I would like to detect disk failures. Right now my setting is to alert when any of the disks health_ok isn’t true by checking its value:
from(bucket: "telegraf")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "smart_device")
|> filter(fn: (r) => r["_field"] == "health_ok")
|> toInt()
But, unfortunately, there is also a chance of disk being off/dead/disconnected. When this would happen, there will be no time series available for that specific disk, so the alert will not trigger.
I know I can alert when disk is missing if I create an alert for each disk and set No Data
behavior to Alert
, but I have many disks and I would like one rule to catch this situation.
How can I achieve this, or in a more generalized way: how can I fire an alert when series goes missing from the query? To my understanding I can fill missing data with zeroes, and catch these zeroes when they appear, but always after some time the series will be removed. How can I approach this situation?
Thank you all for the help!