Recently our Loki open source distributed deployment (v3.4.2) has had issues completing API-initiated adhoc log deletions. We use Azure as our backing storage for observability data, and we occasionally see some network throttling errors (503). A single one of these appears to stop the entire batch of retention requests, according to what’s in the code here: loki/pkg/compactor/retention/retention.go at v3.4.2 · grafana/loki · GitHub
level=error ts=2026-06-06T17:53:01.382158024Z caller=compactor.go:662 msg="failed to compact files" table=loki_index_20026 err="failed to rewrite chunk"
Also, this is appearing to cause some stale index references for chunks that have been deleted as part of partially completed delete requests, since the indexes don’t get updated until after retention has been completed with no errors: loki/pkg/compactor/table.go at v3.4.2 · grafana/loki · GitHub
This is making me nervous that we have some indexes in a bad state. Is my interpretation of the source code here right? If so, anything we can do to avoid having a single one of these errors (which may be transient) kill the entire batch and have it start over? Or anything to do about the stale index references?
I believe comapctor will retry if a batch fails. But it may be a good idea also to set a maximum retention on your object storage.
For example, in our production Loki cluster our maximum retention for any org is 365 days, so we know any chunk file or index file older than that is of no use. So we have a retention policy to delete any file after 375 days, under the index directory path and any other path with the org name.
My team is still discussing retention policies however it may be a while til we are able to come to a consensus as we are in an enterprise setting, is there any other suggested way to deal with these stale index references without enforcing retention?
For example, any way to rebuild the indexes from the current chunk data?
I don’t believe so.
They shouldn’t cause any problem for Loki though, except potentially presenting false data when there aren’t any.
Okay, understood. Thanks for the context.
Since a single error kills entire batches, another idea I had was to reduce the retention batch size to a much smaller value (maybe even 1) so the batches will have a higher chance of completing. Since this is such a big drop from the default size of 70, will this cause any other problems?