Dear experts,
I’m monitoring a bunch of sensors with Grafana from an InfluxDB database.
Besides the actual sensor readings I also want to monitor if the sensors are sending regularly new values, since the sensors use a wireless connection that may have problems.
Since I know the interval at which the sensors are supposed to send data, my idea was to count the sensor entries in the database for a given time frame.
More hands on: when a sensor is supposed to send a new value each 5 min I expect 12 sensor values per hour.
I used the following query in Grafana in a time series visualization: SELECT count("temperature") FROM "room_temp" WHERE ("device_id"::tag = 'xyz') GROUP BY time(60m)
That almost gives me what I need. Anyhow, there are two issues:
The GROUP BY TIME function looks ahead.
Example: I get a data-point from that query each hour (8:00, 9:00, 10:00, …). Let’s assume it is 11:36. From the start of the 11:00 data-point I have 7 seven sensor values (one each 5 min), hence the value for the 11:00 data-point is 7, which gives the impression that I haven’t received enough sensor readings, since I expect 12/ hour. What I want to achieve is to look 1 h backwards.
Because only when I look back I have now an idea about how regularly a sensor is sending data. If look ahead I can only tell it at the end of each hour.
Is there a way to group by the last hour?
I think this one is more tricky. The time-base of some sensors is not ultra precise.
Real world example:
I received a sensor value at 8:59:26. Since the actual time between sensor values is 5:06 min, the 12th value for that hour is received at 10:00:38. Hence during that 1 hour interval only 11 values are received. Looks like 1 lost data value, but is just a bit of missing timing precision.
I tried to increase the group by interval to 61,5 min. But besides the fact, that this gives a bit strange data points (no longer one each hour at the full hour) I have now from time to time 13 sensor values in an interval .
Any idea of a good way to prevent that?
According to InfluxQL docs, you should be able to use the offset_interval within the GROUP BY clause to offset intervals back by 1h.
ie.
WHERE ("device_id"::tag = 'xyz') GROUP BY time(60m, -60m)
Regarding your second issue:
I have now from time to time 13 sensor values in an interval .
Any idea of a good way to prevent that?
Without knowing too much about the sensor data coming in, this may require some combination of intervention on part of your sensors and/or InfluxQL. Perhaps you could normalize the timestamps?
I tried that but it seems to change nothing. I looked into the documentation and though it is hard for me to fully grasp it, I understood the following:
The offset_interval does only move the queried frame relatively to the given time_interval.
E.g. if I would use GROUP BY time(60m, -30m), datapoints will be counted between
00:30 … 01:30
01:30 … 02:30
02:30 … 3:30
and so on.
GROUP BY time(60m, -60m) does simply give the exact some interval that would be used without the offset_interval.
Now it is here 9:24. The query still counts the datapoints between 9:00 and 10:00 and gives currently 5 for the datapoint that is plotted for 9:00. It does not count the datapoints between 8:00 and 9:00, what I try to achieve.
I looked into that, but I’m not sure, even more since it seems to require FLUX as query language and I avoided up to now to look into that.
I thought about something different. Instead of counting data points in 1 hour I could look at the timing gap of a datapoint to its predecessor. So I would avoid the issue that the counting (with GROUP BY time) counts exactly from full hour to full hour, whereas the sensors send whenever they think its time to send (though keeping - more or less - the 5 min interval). Using the example of 1 datapoint each 5 min I could do the evaluation:
gap 4 min … 6 min → OK
gap > 6 min → there is something (1 datapoint lost) → yellow alert
gap > 11 min → NOK (2 or more datapoints lost) → red alert
But i was not able to find a way to calculate this timing gap to the previous data point. Do you have any idea?
Actually I’m talking about LoRaWAN sensors. They send their data to a so called network server (TTN) and telegraf fetches data from TTN via a protocol called MQTT.
The sensors are comparably simple. Their main job (besides doing a precise measurement) is to live on a battery as long as possible (5 … 10 years), since you may install sensors in remote areas (think of smart farming for example). They usually don’t have something like an RTC (real time clock) on board, so they don’t know the exact time. If you switch them on they send (after some network management) their sensed value(s). And about 5 min later (or whatever is configured), they do it again, regardless of the actual time.
Was just in the process of doing that when your post came in.
I also noticed that this is not really a Grafana problem but an influxdb query question.
You’re right, that works great when you want look at the current value! I added a gauge now to show that.
But I was not completely clear in my question. I also want to monitor this metric over time.
The momentary value in a gauge is fine with your query.
In the time series you can see the result of SELECT count("temperature") FROM "room_temp" WHERE ("device_id"::tag = 'xyz') GROUP BY time(60m)
Marked in orange are “11-counts” due to time deviation and in red the momentary that shows the counts in the running full hour.
What I’m behind are the red marked situations in this chart:
Notes:
I recommend to use $_from/$_to before, e.g. InfluxQL:
SELECT
(($__to-$__from)/1000)*(COUNT("field") / COUNT("field")) AS selected_dashboard_timerange_in_seconds,
((($__to-$__from)/1000)*(COUNT("field") / COUNT("field"))/300) AS expected_datapoints_counts_if_5min_ingestion,
COUNT("field") AS actual_datapoints_counts
FROM "measurement" WHERE ... AND $timeFilter
So it’s dynamic calculation based on selected dashboard timerange:
but I need timeseries, so I need to query full hours - use moment (that’s lib used for date parsing in the Grafana) date formatting of timerange in Grafana, so Grafana will generate only full hour timerange for InfluxDB query, I guess:
(($__to-$__from)/1000)*(COUNT("field") / COUNT("field")) AS selected_dashboard_timerange_in_seconds
What is the reason for calculating the seconds of the time range? It seems to me that it provides not added value.
((($__to-$__from)/1000)*(COUNT("field") / COUNT("field"))/300)
I think the part (COUNT("field") / COUNT("field") is not needed.
Wouldn’t ((($__to-$__from)/1000) / 300) give the same result?
If I understand your suggestion correctly, this will also give momentary values (situation right now). It just considers a longer time range (dependent from the selected time range) than 1 h.
Correct?
What I want to achieve is a time series where I can see when I lost sensor values. Maybe I can relate it to installation / update / maintenance work. Or I can detect patterns: every morning between 3:00 and 5:00 I loose data, because the bakery close to my sensor is starting some big electric machines that disturbes the wireless transmission (just to make something up ).
That’s demonstration - just small subpart of bigger picture. That’s not momentary value - that’s average value for selected timerange. You have million sensors - which one should be inspected in details (e.g. check timeseries for issues/patters)? Exactly those, which have avg availability <100 (e.g. for last 7 days).
I provided a idea, which should be developed into proper solution for your case or abandoned.