I noticed an interesting issue with our loki ingesters. The wal directory is filling up the filesystem
The relevant config:
wal:
enabled: true
dir: /var/loki/wal
checkpoint_duration: 5m0s
flush_on_shutdown: true
replay_memory_ceiling: 4GB
And yet
/dev/nvme2n1 19.5G 19.5G 0 100% /var/loki
I did not find anything obvious. Assuming the replay_memory_ceiling does what I think it should - limit the wal to 4gb - then we should not be filling up a 20gb filesystem like this.
Is there a config change I need to make to cap the wal at a certain size? Do I just need to figure out a larger size for the wal per instance somehow? ( ingestion rate? )
What could cause the wal to fill up like this?
I think you may have misunderstood the configuration replay_memory_ceiling
.
According to documentation:
# Maximum memory size the WAL may use during replay. After hitting this, it
# will flush data to storage before continuing. A unit suffix (KB, MB, GB) may
# be applied.
# CLI flag: -ingester.wal-replay-memory-ceiling
[replay_memory_ceiling: <int> | default = 4GB]
Replay is what happens when ingesters unexpectedly exit and upon restart try to replay what’s in the WAL directory. The configuration has nothing to do with how big your WAL directory is.
If you are running out of disk space on WAL volume, I think the only option is to make it bigger.
Does anyone know of a way to limit the size of the wal logs stored per pod? Just throwing more storage at it feels wrong to me there has to be some way to limit it. Or at least a way to calculate the storage needed somehow.
The load test I’m running has 300 random application names with the same log content. We are running jmeter to post it over and over and at different numbers of VU.
I can see the wal log directory grow and shrink in size. Its just not clear to me how to go about sizing it
So based on the load generation I ended up with a 60gb data volume for the ingester pod. I’m sure we can still fill it up but the ingest rate of the pod itself is the limiting factor not the wal space