WAL files deleted but staying open

We see in our system several times per day WAL files accumulating and using storage even though this are deleted and processed by Loki. We then need to restart Loki to clear these.

For example:
File: checkpoint.031132.tmp

Processed:

Sep 27 02:25:09  loki: level=info ts=2023-09-27T00:25:09.155276006Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/srv/loki_wal/wal/checkpoint.031132
Sep 27 02:25:09  loki: level=info ts=2023-09-27T00:25:09.173460786Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/srv/loki_wal/wal/checkpoint.031132
Sep 27 02:26:00  loki: level=info ts=2023-09-27T00:26:00.411291663Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/srv/loki_wal/wal/checkpoint.031132.tmp new=/srv/loki_wal/wal/checkpoint.031132
Sep 27 02:26:00  loki: level=error ts=2023-09-27T00:26:00.65447823Z caller=checkpoint.go:617 msg="error checkpointing series" err="create new segment file: open /srv/loki_wal/wal/checkpoint.031132.tmp/00000002: no such file or directory"

The last line shows for every single checkpoint processed by Loki. Not all checkpoint files get ‘stuck’ though.

We then observe that the file remains open (along with the current checkpoint - checkpoint.031135) although it was removed/deleted by Loki:

[root@lokihost wal]# lsof | grep deleted | grep loki
loki      13580               root   43w      REG             253,10 536838144         25 /srv/loki_wal/wal/checkpoint.031132/00000001 (deleted)
loki      13580 13588         root   10w      REG             253,10 536870912         27 /srv/loki_wal/wal/checkpoint.031135/00000001 (deleted) <--- current WAL checkpoint

This is our configuration for the ingester:

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_encoding: snappy
  chunk_retain_period: 10m
  chunk_idle_period: 24h
  chunk_target_size: 1572864
  wal:
    enabled: true
    dir: /srv/loki_wal/wal
    checkpoint_duration: 1m
    replay_memory_ceiling: 7GB
  max_transfer_retries: 0

We also are running Loki with the same settings in other environments and we do not see that issue there.

Hi, I have the same problem and I am looking for solutions.

Hi,

I fixed it at my side.
I had 2 instances of Loki writing to the same wal directory and that was causing the issue. I created a separate wal directory for each instances and issues was fixed.

We turned off the Wal files. It happened only in one instance and we are not sure why did it happen.

Check also the checkpoint duration. Maybe you are writting new ones while the previous did not finish. I am not sure that it can cause issues…