Clean up ephemeral nodes before shutdown

Hello, I’m trying to replace an ancient grafana setup (backed by influxdb) with modern grafana (12.3) as a proof-of-concept. I’ve got grafana configured and able to query influxdb, and now I’m trying to build things to replace what exists now.

We have a lot of ephemeral nodes that create a dashboard (with alerts) for themselves when they are created, then remove the dashboard (and alerts) when they are shut down. This works well enough, but now that grafana alerting has decoupled alerts from dashboards, I’d like to use the multi-dimensional alerting to not have to do this. Instead of having 20 webservers coming online and each one creating a dashboard for itself that monitors webserver things, I’ll just have alerts that monitor all of the webservers and notifies if something is wrong with any, many, or all of them.

So far, so good. Where I’m getting hung up at is on teardown. When scaling down, I am missing how to tell grafana to not alert when a node goes missing. I can just turn off “No Data” alerts, but those are useful in the event that a server goes unresponsive due to OOM or IO issue or something similar.

The best thing I can think of is to have the grouped alerts as I’ve described them ignoring No Data conditions, and then have each node just setup a single alert (like cpu) to catch No Data conditions, and remove it on shutdown. This feels kludgy to me, so I am creating this post to see if there’s a more elegant way to handle this or if I’m just missing something obvious or just overthinking it.

Thanks in advance for any help!

OK, so you have Nodata - how do you know from the metrics that it’s Nodata, because there was scaling down event (so no real problem) or it is real node issue?
I guess you don’t have operational “state”, so you need that first as a metric and then you can join it with existing metrics and alert only resources which are (or should be) running and not on terminated.

Thank you for the fast response! Yes, I suppose this is the piece I am missing. Do you have any suggestions of how to setup a state? I can think of a couple of ways, but I am not sure if there’s a best practice or preferred way to achieve this. What comes to mind is:

  • setup a simple REST api with server state and add a grafana datasource to scrape that API
  • put server state into a database and have grafana poll the database
  • Add a tag to telegraf like alerting = “enabled”, then change this tag to “disabled” before the system goes down, restart telegraf, and wait for the interval to pass before shutting down. This would work, but if a node failed unexpectedly it would be difficult to clean up

@bohnjamin You did not mention which version of InfluxDB, perhaps creating a Deadman alert would help? For v3 Deadman Alerts with Grafana and InfluxDB Cloud 3 | InfluxData and for v2 How to: Deadman Check to Alert on Service Outage | InfluxData