Colleagues and myself encountered a problem with one of our analytics set.
We are open for any suggestion, even if we need to change the way we collect the data and the DB we are using.
Our users creates hundreds of ec2 instances per day, in many regions, zones, types, etc…
We have an analytics task, written in Python that collects the amount of all ec2 instances, every 15 minutes, per many types/attributes.
Collected record example:
User | Instance type | our App version | zone | region | num of instances
Dave | t2.Smaill | v/3.2 | a | us-east1| 8
We store these records in Influx DB, which, of course declares a time stamp for every record.
Grafana is querying the data from influx db and aggregates the values (amount of instances) by attributes, such as ‘User’, and by time interval (which uses the time stamp that created by influxDB).
SELECT sum(“count”) FROM "system_usage_table” WHERE $timeFilter GROUP BY time($interval), “zone” fill(null)
The default aggregation by time can show wrong data when, for example, the time interval is 30 minutes, 2 records such as the above (presented in influxDB with different timestamp), will resulted in the graph as user Dave have '16 machines’ within this time range, which is not true, he never raised the amount over 8.
More than that, sometimes the records arrive in a time that is not fixed 15 minutes, so even if I aggregate with usual 15 minutes (as the scheduled task run) I get wrong data.
My naive suggestion:
Moving to SQL DB (PostgresSQL or MySql) and add new field per each record- ‘iteration’ which represents the analytics task ‘run number’. Then in Grafana, aggregating by iteration and then by any other added attribute to the query and present (not aggregate) the data on a time line.
Do you think I can find a solution for this in our influxDB and Grafana?
If not, should my suggestion work?
Anybody else encounter the same problem?
Thanks for your help,