In 2 different clusters, our Loki Write pods will go to 100% error rates. The memory usage of the pods appears to go up with it. We are running on EKS, with S3 Backends.
The pods themselves show this
level=warn ts=2023-06-09T01:54:15.351501857Z caller=logging.go:86 traceID=7e4ab3ddf089a6e7 orgID=fake msg="POST /loki/api/v1/push (500) 5.000473365s Response: \"rpc error: code = DeadlineExceeded desc = context deadline exceeded\\n\" ws: false; Content-Length: 1757; Content-Type: application/x-protobuf; L5d-Client-Id: loki-sa.loki.serviceaccount.identity.linkerd.cluster.local; User-Agent: promtail/2.8.2; "
These are small, very low usage clusters, so I am confused why once a week this happens, and the only fix so far has been to kill the pods.