Telegraf collects data and store it in an influxDB. I use Grafana to visualize the data and for alerts.
Now, there are performance issues, if the influxDB queries execute by grafana to check the alert rules.
There are 1262 alerts with the following influx query in:
from(bucket: "IoT")
|> range(start: -1d)
|> filter(fn: (r) => r["_measurement"] == <individuell per alert>)
|> filter(fn: (r) => r["_field"] == <individuell per alert>)
|> sort(columns: ["_time"], desc: false)
|> map(fn: (r) => {
elapsed_time = if r._value == true then int(v: uint(v: now()) - uint(v: r._time)) / 1000000000 else 0
return { r with elapsed_t: elapsed_time }
})
|> keep(columns: ["elapsed_t", "_time"])
Further I take the last value and check if the value is higher as a threshold.
For each alert rule the goal is to check, if the last value is true and if the last value is true, what is the time difference between the time of the value and the current timestamp.
If the alert check needs more then 3 minutes. In this time, grafana stucks also and nothing works because the CPUs are in 100% usage. I upgrade the server to 8-cores, each with 2,6GHz and upgrade the RAM to 24GB. The application runs on a windows server 2019 in docker.
How can I mitigate the issues? I aim to provide a smooth and nice to use grafana dashboard.