Loki compactor retention batch fails from a single chunk error and leaves stale index references

ryanmin · June 8, 2026, 6:08pm

Recently our Loki open source distributed deployment (v3.4.2) has had issues completing API-initiated adhoc log deletions. We use Azure as our backing storage for observability data, and we occasionally see some network throttling errors (503). A single one of these appears to stop the entire batch of retention requests, according to what’s in the code here: loki/pkg/compactor/retention/retention.go at v3.4.2 · grafana/loki · GitHub

level=error ts=2026-06-06T17:53:01.382158024Z caller=compactor.go:662 msg="failed to compact files" table=loki_index_20026 err="failed to rewrite chunk"

Also, this is appearing to cause some stale index references for chunks that have been deleted as part of partially completed delete requests, since the indexes don’t get updated until after retention has been completed with no errors: loki/pkg/compactor/table.go at v3.4.2 · grafana/loki · GitHub

This is making me nervous that we have some indexes in a bad state. Is my interpretation of the source code here right? If so, anything we can do to avoid having a single one of these errors (which may be transient) kill the entire batch and have it start over? Or anything to do about the stale index references?

tonyswumac · June 8, 2026, 8:31pm

I believe comapctor will retry if a batch fails. But it may be a good idea also to set a maximum retention on your object storage.

For example, in our production Loki cluster our maximum retention for any org is 365 days, so we know any chunk file or index file older than that is of no use. So we have a retention policy to delete any file after 375 days, under the index directory path and any other path with the org name.

ryanmin · June 8, 2026, 9:05pm

My team is still discussing retention policies however it may be a while til we are able to come to a consensus as we are in an enterprise setting, is there any other suggested way to deal with these stale index references without enforcing retention?

For example, any way to rebuild the indexes from the current chunk data?

tonyswumac · June 8, 2026, 9:45pm

I don’t believe so.

They shouldn’t cause any problem for Loki though, except potentially presenting false data when there aren’t any.

ryanmin · June 10, 2026, 6:04pm

Okay, understood. Thanks for the context.

Since a single error kills entire batches, another idea I had was to reduce the retention batch size to a much smaller value (maybe even 1) so the batches will have a higher chance of completing. Since this is such a big drop from the default size of 70, will this cause any other problems?

Topic		Replies	Views
Loki Retention Not Working (docker) Grafana Loki loki , retention	5	103	June 3, 2026
Loki retention: chunks kinda deleted, but the size of the folder didn't decrease Grafana Loki retention	2	86	December 23, 2025
Compaction Error: Compaction failing with GCS bucket retention policy (403 retentionPolicyNotMet) Grafana Loki configuration	3	144	September 3, 2025
Force compressions Grafana Loki loki	3	470	February 5, 2024
Loki chunks and index not keeping forever Grafana Loki	8	235	June 2, 2025

Loki compactor retention batch fails from a single chunk error and leaves stale index references

Related topics