I’m trying to get started using postgres for IoT sensor time-series storage and grafana. I know about time-series specific databases and such, as well as timescaledb, but would prefer to go with plain postgres if possible for a variety of reasons. I’m writing-up my plans in the hope that it spurs some discussion that may help me find a better design or that helps others reading this in the future.
I have relatively modest requirements: ~1000 time-series with a data point every minute, more specifically:
- ~1000 time-series mostly with a data point every minute
- typ message from a device contains about 10 metrics, or put differently, groups of about 10 metrics share timestamps and much of the meta-data
- quad-core ARM w/2GB memory, eMMC storage (i.e. not a big x64 box)
- at least 10-years lifetime of the DB (not necessarily the HW)
My first option is to use a narrow data table for the time-series with just three columns: time-stamp, tags, and value; where tags is a foreign key to a record that has all the meta-data, e.g., device, sensor, metric, name, scale, unit, location, etc… This provides great flexibility because a device can report whatever metrics it wants to and they can be represented individually.
I believe this means that assuming timestamp, integer, real columns a data table row uses
23+8+4+4=39 Edit: 23+1+8+4+4=40 bytes. Thus 1000 metrics, for one year, every minute would accumulate to about 21GB of data. That’s not impossibly large but not small either, given the “low end” system I’m contemplating.
One thing that is clear is that I need to partition the table by timestamp, probably monthly, so I can deal with old data, whether that’s deleting or archiving.
I’m pretty sure I will need to produce aggregated tables, e.g. hourly, daily or whatnot. Otherwise I expect that the time to produce the data for a dashboard full of yearly graphs will take forever. What is not clear to me is how to automatically switch between the raw and the various aggregated tables on the grafana end, I wonder whether a view can do that so I don’t have to have mile-long SQL in each graph’s definition.
An alternative design would be to use a wider table for the time-series. E.g. a row could hold timestamp, foreign key to meta-data, and an array of values (all the values reported by a device in one message). Assuming an average of 10 values in the array that would result in
23+8+4+10*4=75 bytes per row and 4GB per year. Edit: I forgot the array overhead, which seems to be ~20 bytes. I ended up using pg_column_size to find out that a row with 10 values uses 97 bytes, which sums to 5GB per year.
The 4x size reduction over option 1 is certainly attractive. The downside is that the SQL becomes more complex because the meta-data has to specify which array element to extract the value from.
Thoughts? I’m sure others have gone down this or a very similar path, but I haven’t found a whole lot of write-ups or experiences…