So I have a really old TIG stack that I inherited. It’s on the roadmap to be upgraded was futher down the list mainly because it just worked. (that might be changing now keep reading)
So we have a TIG stack, I have a sql server windows 2012 server with telegraf.
We are on grafana 4.08 telegraf: 1.0.0-beta3-54-gdbf6380. need to figure out the influxdb version but it’s old.
So this has been running for 4 years without a hiccup. Then recently our dashboard that monitors the CPU also has an alert for “NO DATA”, that alert would start firing and the go “OK” right after and would do that endlessly with no stopping. Meanwhile the CPU graph is still getting data populated. I thought it might be on the grafana / influx side because of that. But after digging, basically if I restart the Telegraf service. (windows 2012 r2 server) the error stops. I might go a day I might go a week before it starts barking again. I have 5 servers running an identical setup, only ONE of them does this.
I have dug through the windows logs and there is nothing from telegraf in there. Or anything else suspicous. I tried to get telegraf to write out a proper log file but that’s being problematic. My main question is has anyone else ever experienced similar and what was it in your case?
Of course this is expediting our let’s upgrade our stack timeline but this is a mystery, and in IT we don’t like mysteries