We see in our system several times per day WAL files accumulating and using storage even though this are deleted and processed by Loki. We then need to restart Loki to clear these.
For example:
File: checkpoint.031132.tmp
Processed:
Sep 27 02:25:09 loki: level=info ts=2023-09-27T00:25:09.155276006Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/srv/loki_wal/wal/checkpoint.031132
Sep 27 02:25:09 loki: level=info ts=2023-09-27T00:25:09.173460786Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/srv/loki_wal/wal/checkpoint.031132
Sep 27 02:26:00 loki: level=info ts=2023-09-27T00:26:00.411291663Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/srv/loki_wal/wal/checkpoint.031132.tmp new=/srv/loki_wal/wal/checkpoint.031132
Sep 27 02:26:00 loki: level=error ts=2023-09-27T00:26:00.65447823Z caller=checkpoint.go:617 msg="error checkpointing series" err="create new segment file: open /srv/loki_wal/wal/checkpoint.031132.tmp/00000002: no such file or directory"
The last line shows for every single checkpoint processed by Loki. Not all checkpoint files get ‘stuck’ though.
We then observe that the file remains open (along with the current checkpoint - checkpoint.031135) although it was removed/deleted by Loki:
[root@lokihost wal]# lsof | grep deleted | grep loki
loki 13580 root 43w REG 253,10 536838144 25 /srv/loki_wal/wal/checkpoint.031132/00000001 (deleted)
loki 13580 13588 root 10w REG 253,10 536870912 27 /srv/loki_wal/wal/checkpoint.031135/00000001 (deleted) <--- current WAL checkpoint
This is our configuration for the ingester:
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_encoding: snappy
chunk_retain_period: 10m
chunk_idle_period: 24h
chunk_target_size: 1572864
wal:
enabled: true
dir: /srv/loki_wal/wal
checkpoint_duration: 1m
replay_memory_ceiling: 7GB
max_transfer_retries: 0
We also are running Loki with the same settings in other environments and we do not see that issue there.