I have several servers that I want to monitor in order to know the basic stats, and if they are up or down. For that purpose I am using the status history panel. However, I am having several problems in a particular case.
When any of my servers go down, the grafana dashboard tells me that there is no data. In this case, I can’t trigger an alert or make grafana paint that cell in a red colour (or any colour).
So, I was wondering if there is any way of making that “no data in response” message that appears in my dashboard a value (such as 0). In that way I would be able to create a threshold for the 0 value.
I have been using value mapping for this panel. However, it only works for certain cases, and I am still struggling with one of them.
In my dashboard, I have these servers I want to monitor. In order to display all of them, I have created a dashboard variable whose values are the hosts/servers. Once I had the variable created (storage_machines) I wrote the following query and I configured the desired value mapping.
*SELECT mean("value") FROM "cpu_value" WHERE ("host" =~ /^$storage_machines$/ AND "type_instance" = 'system') AND $timeFilter GROUP BY time(15m), "host" fill(null)*
I thought that everything was working fine until I saw that the panel was not showing all my machines. I noticed that the ones that have been down more time than the dashboard time limit (configured at 2 days) were not shown.
When I inspected the raw query I discovered that the system was just quering the variables values that had reported at least once in that period of time, so some of the machines were not appering on my panel (as the query wasn’t giving me data about them):
*SELECT mean("value") FROM "cpu_value" WHERE ("host" =~ /^(machine1|machine2|machine3...)$/ AND "type_instance" = 'system') AND **time >= now() - 2d and time <= now()** GROUP BY time(15m), "host" fill(null)*
Since I want to display all the machines (even if they have not been reporting for a long time), I was wondering if there is the possibility to display all the values (in my case the hosts/servers) that compose my variable, even if they have not been reporting for more time than the dashboard time limit. Therefore, the down servers would be the ones that have not been reporting for more than the dashboard limit time, and the ones that have not reported in the last 15 min (as the query I originally wrote).