Clean up ephemeral nodes before shutdown

bohnjamin · December 8, 2025, 6:12pm

Hello, I’m trying to replace an ancient grafana setup (backed by influxdb) with modern grafana (12.3) as a proof-of-concept. I’ve got grafana configured and able to query influxdb, and now I’m trying to build things to replace what exists now.

We have a lot of ephemeral nodes that create a dashboard (with alerts) for themselves when they are created, then remove the dashboard (and alerts) when they are shut down. This works well enough, but now that grafana alerting has decoupled alerts from dashboards, I’d like to use the multi-dimensional alerting to not have to do this. Instead of having 20 webservers coming online and each one creating a dashboard for itself that monitors webserver things, I’ll just have alerts that monitor all of the webservers and notifies if something is wrong with any, many, or all of them.

So far, so good. Where I’m getting hung up at is on teardown. When scaling down, I am missing how to tell grafana to not alert when a node goes missing. I can just turn off “No Data” alerts, but those are useful in the event that a server goes unresponsive due to OOM or IO issue or something similar.

The best thing I can think of is to have the grouped alerts as I’ve described them ignoring No Data conditions, and then have each node just setup a single alert (like cpu) to catch No Data conditions, and remove it on shutdown. This feels kludgy to me, so I am creating this post to see if there’s a more elegant way to handle this or if I’m just missing something obvious or just overthinking it.

Thanks in advance for any help!

jangaraj · December 8, 2025, 6:57pm

OK, so you have Nodata - how do you know from the metrics that it’s Nodata, because there was scaling down event (so no real problem) or it is real node issue?
I guess you don’t have operational “state”, so you need that first as a metric and then you can join it with existing metrics and alert only resources which are (or should be) running and not on terminated.

bohnjamin · December 9, 2025, 2:24pm

Thank you for the fast response! Yes, I suppose this is the piece I am missing. Do you have any suggestions of how to setup a state? I can think of a couple of ways, but I am not sure if there’s a best practice or preferred way to achieve this. What comes to mind is:

setup a simple REST api with server state and add a grafana datasource to scrape that API
put server state into a database and have grafana poll the database
Add a tag to telegraf like alerting = “enabled”, then change this tag to “disabled” before the system goes down, restart telegraf, and wait for the interval to pass before shutting down. This would work, but if a node failed unexpectedly it would be difficult to clean up

grant2 · December 11, 2025, 10:58am

@bohnjamin You did not mention which version of InfluxDB, perhaps creating a Deadman alert would help? For v3 Deadman Alerts with Grafana and InfluxDB Cloud 3 | InfluxData and for v2 How to: Deadman Check to Alert on Service Outage | InfluxData

Topic		Replies	Views
(Datasource)NoData alerts stick around forever Alerting alerting	1	42	December 12, 2025
Create Alert when a service goes off Alerting alerting	1	3532	April 24, 2017
How to disable alert when grafana server goes down and come up Configuration alerting	2	1322	June 1, 2021
Alerting for no metric recieved Alerting alerting	0	98	May 3, 2024
Alerting when applications or services go down in multivalue queries Grafana	4	5692	May 25, 2020

Clean up ephemeral nodes before shutdown

Related topics