We see in our system several times per day WAL files accumulating and using storage even though this are deleted and processed by Loki. We then need to restart Loki to clear these.
Sep 27 02:25:09 loki: level=info ts=2023-09-27T00:25:09.155276006Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/srv/loki_wal/wal/checkpoint.031132 Sep 27 02:25:09 loki: level=info ts=2023-09-27T00:25:09.173460786Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/srv/loki_wal/wal/checkpoint.031132 Sep 27 02:26:00 loki: level=info ts=2023-09-27T00:26:00.411291663Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/srv/loki_wal/wal/checkpoint.031132.tmp new=/srv/loki_wal/wal/checkpoint.031132 Sep 27 02:26:00 loki: level=error ts=2023-09-27T00:26:00.65447823Z caller=checkpoint.go:617 msg="error checkpointing series" err="create new segment file: open /srv/loki_wal/wal/checkpoint.031132.tmp/00000002: no such file or directory"
The last line shows for every single checkpoint processed by Loki. Not all checkpoint files get ‘stuck’ though.
We then observe that the file remains open (along with the current checkpoint - checkpoint.031135) although it was removed/deleted by Loki:
[root@lokihost wal]# lsof | grep deleted | grep loki loki 13580 root 43w REG 253,10 536838144 25 /srv/loki_wal/wal/checkpoint.031132/00000001 (deleted) loki 13580 13588 root 10w REG 253,10 536870912 27 /srv/loki_wal/wal/checkpoint.031135/00000001 (deleted) <--- current WAL checkpoint
This is our configuration for the ingester:
ingester: lifecycler: address: 127.0.0.1 ring: kvstore: store: inmemory replication_factor: 1 final_sleep: 0s chunk_encoding: snappy chunk_retain_period: 10m chunk_idle_period: 24h chunk_target_size: 1572864 wal: enabled: true dir: /srv/loki_wal/wal checkpoint_duration: 1m replay_memory_ceiling: 7GB max_transfer_retries: 0
We also are running Loki with the same settings in other environments and we do not see that issue there.