Loki In Kubernetes - Write goes 100% Error Rate

In 2 different clusters, our Loki Write pods will go to 100% error rates. The memory usage of the pods appears to go up with it. We are running on EKS, with S3 Backends.

The pods themselves show this

level=warn ts=2023-06-09T01:54:15.351501857Z caller=logging.go:86 traceID=7e4ab3ddf089a6e7 orgID=fake msg="POST /loki/api/v1/push (500) 5.000473365s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Length: 1757; Content-Type: application/x-protobuf; L5d-Client-Id: loki-sa.loki.serviceaccount.identity.linkerd.cluster.local; User-Agent: promtail/2.8.2; "

These are small, very low usage clusters, so I am confused why once a week this happens, and the only fix so far has been to kill the pods.

Any debug log from your writers?

The only logs being generated are the ones I posted.

If I enable debug logs, it will probably run for a week before it errors. im not sure if that will cause more issues running in debug mode.

Your log simply states that pushing logs to Loki failed, which is expected if your writers aren’t operational. I’d enable debug log, if not hopefully someone else who had seen similar things happen can comment.