Grafana loki S3 cost spike

Hello,
I recently had an incident with loki-grafana installation on k8s (eks) in s3 where the cost for s3 transfer (put/get request) went up x250. This issues was mainly caused because loki-write pod could not flush data after sending them so he kept sending the same objects over and over again. Here is what I discovered after investigation.

  • CPU/RAM usage was very high for loki-write (0.8cpu and 17Gb ram)
  • kubernetes volume for loki-write was full (no space left)

Here is the error that was logged: err="store put chunk: open /var/loki/boltdb-shipper-active/loki_index_ no space left on device

This was resolved by manually deleting files from loki-write disk and restarting it’s pod.
Now the question is how can I configure loki so that this does not happen again since I do not know what caused the high cpu, ram and disk usage of loki-write ? And why loki could not flush old data ?

Thanks in advance.
Best regards.

3 Likes

I’ve never run into this. Two things I think yo should at least consider:

  1. Have monitoring on your loki-write pod’s storage, especially the WAL storage (if you were using community helm chart this should be a persistent volume).
  2. Loki emits metrics, you should scrape these metrics and forward them to a prometheus instance if you have one, then create alert based on metrics, especially the metrics that measure S3 operation failures.

Hello,

Thanks for the response, right about now loki is on a normal behavior sending data to s3 and flushing it from its persistent volume. However we’ll make sure to monitor it to be alerted for any unusual activity.

If the incident had to happen again is there a way to stop loki from sending old data thats present on its persistent volume to s3 (If it doesnt get flushed) I saw that there are parameters in the limit_configs bloc like “reject_old_samples_max_age” and others will this prevent loki from resending old data to s3 if its not flushed ? Assuming that the data has already been ingested and present on the persistent volume.
Any idea on this could help us alot.
Thanks.

The first potential solution that comes to mind is to just clear out the persistent volume. Presumably Loki would just start up as if there is no WAL logs. I’ve not tried this and I don’t know if it’s a good idea (especially if you set replication to higher than 1).

Looking at configuration looks like you can set a memory ceiling for WAL replay, but that won’t necessarily limit your S3 request count.

As of now Loki is up and running normally. But what I really want to know is that the parameter “reject_old_samples_max_age” set in the limit_configs will have any impact on old data requests ?

Loki will reject old message if it’s set to true. Loki will also reject old message (regardless if it’s set to true or not) if a log stream already has more recent logs.

I will set it anyways just to be sure. I guess there not really a 100% sure fix or a config to prevent this from happening in the future.
Thanks for the assistance.