I run telegraf on all our nodes with URL checks setup against all our sites. This creates a many to many checking relationship that mitigates bad data.
In grafana, I setup two queries, the first one is graphed and records the average response time:
alias(averageSeries(telegraf.*.GET.success.https:--my_url_com*.*.*.response_time), 'Avg Reponse Time')
The second one is disabled on the graph but records the response code:
averageSeries(telegraf.a_server_com.GET.success.https:--my_url_com*.*.http_response.http_response_code)
Then I setup my alert like this:
Essentially, I get an alert if the average response time is greater than 1 second or if the average HTTP response code is greater than 300 (which could indicate any number of problems). This combination seems to catch the problem situations you’ll run into.
