I’m experiencing issues with one replica of my Loki Ingester replica set using way more memory than the other replicas. The imbalance in memory usage per-pod can be viewed in the attached screenshot.
I am running Loki in distributed mode in an EKS Kubernetes cluster, with a dynamically scaling number of Distributor pods and a fixed number of Ingester pods. I am using the default memberlist key-value store type. I suspect that one (or several) of my streams are producing orders of magnitude more data, and those streams are mapped to the problematic replica (loki-ingester-21 in this case). The problematic replica has an average number of streams associated with it (700, where most of the other replicas fall in the 500-1000 range), but the number of chunks in memory and total bytes stored in chunks are way higher for the problematic replica compared to the others.
What are some ways I can troubleshoot memory/chunk imbalance issues?
Is it possible to view which label-sets are going to which pods so that we can determine if one type of log has a much higher volume than other types?
Is there a way I can rebalance the pods so that these chunks can be shared amongst all the other pods? If so, how?