InfluxDB Retention Policy following $__interval contents

Hi,

We are using InfluxDB’s Continuous Queries to automatically downsample metrics to Retention Policies with a bigger lifespan.

These are the RPs:

create retention policy A on telegraf duration 1w replication 1 shard duration 168h
create retention policy B on telegraf duration 8w replication 1 shard duration 1344h
create retention policy C on telegraf duration 53w replication 1 shard duration 8904h
create retention policy D on telegraf duration INF replication 1 shard duration 520w

The metric collectors (Telegraf for this cpu metric) send their data to “A”.

Here is an example of a set of CQs on one metric measurement cpu (we run hundreds of them, performs excellent):

CREATE CONTINUOUS QUERY cpu_mean15m ON telegraf RESAMPLE EVERY 30m FOR 3h BEGIN SELECT mean(usage_guest) AS usage_guest, mean(usage_guest_nice) AS usage_guest_nice, mean(usage_idle) AS usage_idle, mean(usage_iowait) AS usage_iowait, mean(usage_irq) AS usage_irq, mean(usage_nice) AS usage_nice, mean(usage_softirq) AS usage_softirq, mean(usage_steal) AS usage_steal, mean(usage_system) AS usage_system, mean(usage_user) AS usage_user INTO telegraf.B.cpu FROM telegraf.A.cpu GROUP BY time(15m),* fill(none) END
CREATE CONTINUOUS QUERY cpu_mean2h ON telegraf RESAMPLE FOR 6h BEGIN SELECT mean(usage_guest) AS usage_guest, mean(usage_guest_nice) AS usage_guest_nice, mean(usage_idle) AS usage_idle, mean(usage_iowait) AS usage_iowait, mean(usage_irq) AS usage_irq, mean(usage_nice) AS usage_nice, mean(usage_softirq) AS usage_softirq, mean(usage_steal) AS usage_steal, mean(usage_system) AS usage_system, mean(usage_user) AS usage_user INTO telegraf.C.cpu FROM telegraf.A.cpu GROUP BY time(2h),* fill(none) END
CREATE CONTINUOUS QUERY cpu_mean1d ON telegraf RESAMPLE FOR 2d BEGIN SELECT mean(usage_guest) AS usage_guest, mean(usage_guest_nice) AS usage_guest_nice, mean(usage_idle) AS usage_idle, mean(usage_iowait) AS usage_iowait, mean(usage_irq) AS usage_irq, mean(usage_nice) AS usage_nice, mean(usage_softirq) AS usage_softirq, mean(usage_steal) AS usage_steal, mean(usage_system) AS usage_system, mean(usage_user) AS usage_user INTO telegraf.D.cpu FROM telegraf.A.cpu GROUP BY time(1d),* fill(none) END

We templated the RP, so the enduser has to choose A,B,C or D
All panels on our main dashboards use the templated RP.

The problem we have with this, is that it is too easy for users to make mistakes.
When they choose A , and select last 7 days , the queries become way too slow (tens of seconds). They’d have to manually select B or C from the Templating dropdown we made for Retention Policies, to get normal performance. And when zooming in on the data (for example by dragging a region), they’d have to remember to switch back at a certain point.

The bad performance is not related to disk io, it is purely the number crunching of going through each second (measurement per second) within 7 days of data of several servers (‘tags’). Even when using large ‘time buckets’, for example only 10 results per graph returned by the server, performance is not improved. We see a high CPU spike on the influxdb server, low io-wait, low io traffic.

Since all the downsampling by the CQ’s happens in near real-time, Grafana should select a different RP automatically when a bigger interval is selected.

Example: when $__interval is bigger than 15 minutes, we want Grafana to use retention policy B instead of A .

Any ideas on how to accomplish this?

I’m not sure yet what to ask for as a feature request. A query postprocessor would probably be most flexible, but an addition to Grafana’s templating framework could also be an option.

Kind regards, TW

1 Like

There’s an open issue for this, but doesn’t seem to be active.