I have some kafka servers such as zookeeper, kafka, ksql…about 10 servers’ metrics throughout prometheus to grafana. I want to build a panel with alerting which can detect if there is any server down for maybe 5 minutes and I’ll get a SMS or email. The channels are OK and I can get the SMS & email but I can’t get how to setup the threshold. I use the metrics–>process–>process_cpu_seconds_total. But when I stop one server’s service, it cannot invoke the alert. Are there any kind seniors who can help to instruct? Thanks!~
Because the ping is in millis result, I suggest to change the evaluation to 1s for 2m, then in use max() instead of avg() and set above 0.5 instead of 1, it will catch the alert ans my experience…
Thank you for your reply~ I have tried another way and it seemed OK.
I set the condition when count(*) of some minutes ago till now is below a number, and if any server down for too long there is alert sending out.
It fulfills my needs.
Thanks a lot!