InfluxDB Retention Policy following $__interval contents

twbert · June 12, 2017, 11:02am

Hi,

We are using InfluxDB’s Continuous Queries to automatically downsample metrics to Retention Policies with a bigger lifespan.

These are the RPs:

create retention policy A on telegraf duration 1w replication 1 shard duration 168h
create retention policy B on telegraf duration 8w replication 1 shard duration 1344h
create retention policy C on telegraf duration 53w replication 1 shard duration 8904h
create retention policy D on telegraf duration INF replication 1 shard duration 520w

The metric collectors (Telegraf for this cpu metric) send their data to “A”.

Here is an example of a set of CQs on one metric measurement cpu (we run hundreds of them, performs excellent):

CREATE CONTINUOUS QUERY cpu_mean15m ON telegraf RESAMPLE EVERY 30m FOR 3h BEGIN SELECT mean(usage_guest) AS usage_guest, mean(usage_guest_nice) AS usage_guest_nice, mean(usage_idle) AS usage_idle, mean(usage_iowait) AS usage_iowait, mean(usage_irq) AS usage_irq, mean(usage_nice) AS usage_nice, mean(usage_softirq) AS usage_softirq, mean(usage_steal) AS usage_steal, mean(usage_system) AS usage_system, mean(usage_user) AS usage_user INTO telegraf.B.cpu FROM telegraf.A.cpu GROUP BY time(15m),* fill(none) END
CREATE CONTINUOUS QUERY cpu_mean2h ON telegraf RESAMPLE FOR 6h BEGIN SELECT mean(usage_guest) AS usage_guest, mean(usage_guest_nice) AS usage_guest_nice, mean(usage_idle) AS usage_idle, mean(usage_iowait) AS usage_iowait, mean(usage_irq) AS usage_irq, mean(usage_nice) AS usage_nice, mean(usage_softirq) AS usage_softirq, mean(usage_steal) AS usage_steal, mean(usage_system) AS usage_system, mean(usage_user) AS usage_user INTO telegraf.C.cpu FROM telegraf.A.cpu GROUP BY time(2h),* fill(none) END
CREATE CONTINUOUS QUERY cpu_mean1d ON telegraf RESAMPLE FOR 2d BEGIN SELECT mean(usage_guest) AS usage_guest, mean(usage_guest_nice) AS usage_guest_nice, mean(usage_idle) AS usage_idle, mean(usage_iowait) AS usage_iowait, mean(usage_irq) AS usage_irq, mean(usage_nice) AS usage_nice, mean(usage_softirq) AS usage_softirq, mean(usage_steal) AS usage_steal, mean(usage_system) AS usage_system, mean(usage_user) AS usage_user INTO telegraf.D.cpu FROM telegraf.A.cpu GROUP BY time(1d),* fill(none) END

We templated the RP, so the enduser has to choose A,B,C or D
All panels on our main dashboards use the templated RP.

The problem we have with this, is that it is too easy for users to make mistakes.
When they choose A , and select last 7 days , the queries become way too slow (tens of seconds). They’d have to manually select B or C from the Templating dropdown we made for Retention Policies, to get normal performance. And when zooming in on the data (for example by dragging a region), they’d have to remember to switch back at a certain point.

The bad performance is not related to disk io, it is purely the number crunching of going through each second (measurement per second) within 7 days of data of several servers (‘tags’). Even when using large ‘time buckets’, for example only 10 results per graph returned by the server, performance is not improved. We see a high CPU spike on the influxdb server, low io-wait, low io traffic.

Since all the downsampling by the CQ’s happens in near real-time, Grafana should select a different RP automatically when a bigger interval is selected.

Example: when $__interval is bigger than 15 minutes, we want Grafana to use retention policy B instead of A .

Any ideas on how to accomplish this?

I’m not sure yet what to ask for as a feature request. A query postprocessor would probably be most flexible, but an addition to Grafana’s templating framework could also be an option.

Kind regards, TW

voiprodrigo · March 13, 2018, 12:40am

There’s an open issue for this, but doesn’t seem to be active.

github.com/grafana/grafana

Automatically chose retention policy based on time range

opened 12:32PM - 05 Mar 16 UTC

closed 10:12AM - 30 Nov 21 UTC

dennisjac

type/feature-request datasource/InfluxDB prio/medium area/datasource

Hi, I'm currently looking at how to make Grafana use the aggregated retention po…licies in InfluxDB. Issue #3943 apparently adds a way to set the retention policy statically for a query but this only works if you have a hand full of queries. The moment you have more dashboards this is not really a viable option. What I'd rather like to propose is the ability to add a table in the settings that defines a retention policy based on time ranges. For example you would define the values like this: 1 month => retention_one_value_per_day 1 week => retention_one_value_per_hour 1 day => retention_one_value_per_10s All queries in all dashboards would then use the retention policy closest to the selected time range as default (which could still be overridden on a per-query basis by explicitly selecting a retention policy there?). If the user doesn't define this table then the default retention policy would always be used. With this approach most people would be able to set up this table once and then all dashboards would automatically always use the right retention policy for their data which is I think what 99% of the people out there expect. Also for anyone who doesn't specify this value Grafana will behave just as before.