Grafana loki S3 cost spike

hedikammoun · September 2, 2024, 1:22pm

Hello,
I recently had an incident with loki-grafana installation on k8s (eks) in s3 where the cost for s3 transfer (put/get request) went up x250. This issues was mainly caused because loki-write pod could not flush data after sending them so he kept sending the same objects over and over again. Here is what I discovered after investigation.

CPU/RAM usage was very high for loki-write (0.8cpu and 17Gb ram)
kubernetes volume for loki-write was full (no space left)

Here is the error that was logged: err="store put chunk: open /var/loki/boltdb-shipper-active/loki_index_ no space left on device

This was resolved by manually deleting files from loki-write disk and restarting it’s pod.
Now the question is how can I configure loki so that this does not happen again since I do not know what caused the high cpu, ram and disk usage of loki-write ? And why loki could not flush old data ?

Thanks in advance.
Best regards.

tonyswumac · September 3, 2024, 9:35pm

I’ve never run into this. Two things I think yo should at least consider:

Have monitoring on your loki-write pod’s storage, especially the WAL storage (if you were using community helm chart this should be a persistent volume).
Loki emits metrics, you should scrape these metrics and forward them to a prometheus instance if you have one, then create alert based on metrics, especially the metrics that measure S3 operation failures.

hedikammoun · September 4, 2024, 8:13am

Hello,

Thanks for the response, right about now loki is on a normal behavior sending data to s3 and flushing it from its persistent volume. However we’ll make sure to monitor it to be alerted for any unusual activity.

If the incident had to happen again is there a way to stop loki from sending old data thats present on its persistent volume to s3 (If it doesnt get flushed) I saw that there are parameters in the limit_configs bloc like “reject_old_samples_max_age” and others will this prevent loki from resending old data to s3 if its not flushed ? Assuming that the data has already been ingested and present on the persistent volume.
Any idea on this could help us alot.
Thanks.

tonyswumac · September 4, 2024, 7:09pm

The first potential solution that comes to mind is to just clear out the persistent volume. Presumably Loki would just start up as if there is no WAL logs. I’ve not tried this and I don’t know if it’s a good idea (especially if you set replication to higher than 1).

Looking at configuration looks like you can set a memory ceiling for WAL replay, but that won’t necessarily limit your S3 request count.

hedikammoun · September 5, 2024, 8:14am

As of now Loki is up and running normally. But what I really want to know is that the parameter “reject_old_samples_max_age” set in the limit_configs will have any impact on old data requests ?

tonyswumac · September 5, 2024, 7:50pm

Loki will reject old message if it’s set to true. Loki will also reject old message (regardless if it’s set to true or not) if a log stream already has more recent logs.

hedikammoun · September 6, 2024, 8:29am

I will set it anyways just to be sure. I guess there not really a 100% sure fix or a config to prevent this from happening in the future.
Thanks for the assistance.

Topic		Replies	Views
Loki Pushing Logs to S3 increase the AWS Cost Grafana Loki loki	1	144	January 9, 2025
Error in Loki-write pods Grafana Loki	5	1192	July 19, 2025
Logs no longer returned once in s3 Grafana Loki aws	1	631	March 29, 2023
Loki not writing to s3 no errors Grafana Loki aws , kubernetes	4	2200	June 11, 2023
Loki Helm: loki-write : read only + 503 probe Grafana Loki loki	1	83	May 19, 2025

Grafana loki S3 cost spike

Related topics