Tips for troubleshooting Ingester pod memory imbalance when running Loki in distributed mode in Kubernetes

I’m experiencing issues with one replica of my Loki Ingester replica set using way more memory than the other replicas. The imbalance in memory usage per-pod can be viewed in the attached screenshot.

I am running Loki in distributed mode in an EKS Kubernetes cluster, with a dynamically scaling number of Distributor pods and a fixed number of Ingester pods. I am using the default memberlist key-value store type. I suspect that one (or several) of my streams are producing orders of magnitude more data, and those streams are mapped to the problematic replica (loki-ingester-21 in this case). The problematic replica has an average number of streams associated with it (700, where most of the other replicas fall in the 500-1000 range), but the number of chunks in memory and total bytes stored in chunks are way higher for the problematic replica compared to the others.

What are some ways I can troubleshoot memory/chunk imbalance issues?
Is it possible to view which label-sets are going to which pods so that we can determine if one type of log has a much higher volume than other types?
Is there a way I can rebalance the pods so that these chunks can be shared amongst all the other pods? If so, how?

I am facing a similar issue. Have you made any progress on this?

I did run into this once as well. From my understanding is that the writers, once the ring membership is formed, will try to send the log streams with same labels to the same writer to reduce the number of chunks written, and I do not know if there is re-balancing logic in this (need to read through the code eventually). Perhaps someone can comment on this.

With that in mind, I killed that one container manually and the new one that replaced it seemed to be more in-line with everything else in terms of resource consumption.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.