Monitor website is up or down using grafana

I run telegraf on all our nodes with URL checks setup against all our sites. This creates a many to many checking relationship that mitigates bad data.

In grafana, I setup two queries, the first one is graphed and records the average response time:

alias(averageSeries(telegraf.*.GET.success.https:--my_url_com*.*.*.response_time), 'Avg Reponse Time')

The second one is disabled on the graph but records the response code:

averageSeries(telegraf.a_server_com.GET.success.https:--my_url_com*.*.http_response.http_response_code)

Then I setup my alert like this:

Essentially, I get an alert if the average response time is greater than 1 second or if the average HTTP response code is greater than 300 (which could indicate any number of problems). This combination seems to catch the problem situations you’ll run into.