Hey everyone
We’re running into an issue with our on-prem Loki setup (not on Kubernetes) and wanted to see if anyone else has faced something similar or has ideas on how to fix it.
Our Setup
-
Cluster: 2-node Loki setup (Ingestor / Distributor / Querier roles)
-
Promtail: Managed by our DevOps team (runs as a DaemonSet — we don’t have admin access, so a bit of dependency there)
-
Storage: Everything is on local filesystem mounts:
/data/loki/wal /data/loki/chunks /data/loki/index -
Retention: Configured in
loki-config.yamlfor 5 days- We’ve tuned parameters like chunk block size (2 MB) to optimize space usage
-
Deletion API: We’ve also tried using the
curldeletion API, but understand it only marks data for deletion in the index — it doesn’t immediately remove chunk files.
What We’re Observing
-
Chunk file timestamps keep changing
Even old chunk files under/data/loki/chunkskeep getting their “modified date” updated.
When we check with:ls -ltrh /data/loki/chunkswe can see the modification dates changing regularly, even for files that should be past retention.
-
Retention logic not kicking in
We wrote a small custom cleanup script to delete chunks older than 5 days, but since the modified date keeps updating, nothing ever qualifies for deletion. -
Disk usage keeps going up
Despite retention being set to 5 days (and even after testing the deletion API), the/data/loki/chunksfolder size keeps increasing.
It doesn’t seem like the old data is actually being removed.
Questions
-
Why would Loki be updating chunk file modification timestamps?
-
How can we make sure chunks are actually deleted after retention (and not just marked)?
-
Is there a better approach to enforce retention-based cleanup when running Loki on filesystem storage — not in Kubernetes and not using object storage?
-
Could something like WAL replay or a background process (e.g. compactor) be touching those files and resetting timestamps?
loki version 2.9.3
loki yaml:
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 5m
chunk_retain_period: 30s
max_transfer_retries: 0
chunk_block_size: 2097152
chunk_encoding: snappy
max_chunk_age: 2hdistributor:
ring:
kvstore:
store: memberlistquerier:
engine:
timeout: 1m
query_ingesters_within: 2hquery_range:
align_queries_with_step: true
max_retries: 5
parallelise_shardable_queries: truefrontend:
log_queries_longer_than: 5s
compress_responses: truefrontend_worker:
frontend_address: X.X.X.Xschema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24hstorage_config:
boltdb_shipper:
active_index_directory: /data/loki/index
cache_location: /data/loki/index_cache
shared_store: filesystemfilesystem:
directory: /data/loki/chunkscompactor:
working_directory: /data/loki/compactor
shared_store: filesystem
retention_enabled: true
delete_request_cancel_period: 1m
deletion_mode: filter-and-deletelimits_config:
ingestion_burst_size_mb: 500
ingestion_rate_mb: 1024
ingestion_rate_strategy: global
max_cache_freshness_per_query: 10m
max_global_streams_per_user: 50000
max_query_parallelism: 128
per_stream_rate_limit: 500MB
per_stream_rate_limit_burst: 500MB
reject_old_samples: true
reject_old_samples_max_age: 120h
retention_period: 5d
split_queries_by_interval: 15m
enforce_metric_name: false
max_entries_limit_per_query: 80000
allow_deletes: truechunk_store_config:
max_look_back_period: 0stable_manager:
retention_deletes_enabled: true
retention_period: 5dcurl api:
curl -X DELETE “http://<loki_host>:3100/loki/api/v1/delete”
-H “X-Scope-OrgID: <tenant_id>”
–data-urlencode ‘query={job=“varlogs”} |~ “sensitive_data”’
–data-urlencode ‘start=<start_timestamp>’
–data-urlencode ‘end=<end_timestamp>’If anyone has seen this behavior or has tips on how to manage chunk cleanup and ensure retention works effectively in a non-K8s setup, please share your thoughts.
Thanks in advance!