We’re using Loki for log aggregation across multiple EKS and AKS clusters, with highly dynamic workloads. We’re deployed in distributed/microservices mode with 6 distributors and 12 ingesters.
One of our workloads scales massively out (from 10-200 pods to 2400 pods) every night for roughly 1 hour. During this scale out period, we consistently see a single distributor (and often the same distributor when Loki pods haven’t been rotated during the day) with much higher memory usage than the other. A portion of requests to that distributor fail.
Requests are evenly distributed to all distributors (both in terms of lines per sec and MB per sec).
Ingesters’ CPU & memory is pretty consistent with no significant outliers.
Our metrics paint the picture:
Top level volume:
Write latency spikes:
Distributor metrics (note the single distributor with high memory):
Ingester metrics:
Relevant portions from our helm values:
loki:
limits_config:
allow_structured_metadata: true
ingestion_rate_mb: 128
ingestion_burst_size_mb: 256
max_label_names_per_series: 20
max_line_size_truncate: true # truncate lines that exceed max_line_size (256KB)
max_streams_per_user: 0 # Maximum number of active streams per user, per ingester. 0 to disable.
max_global_streams_per_user: 10000 # Maximum number of active streams per user, across the cluster (default 5000)
max_query_series: 2000
retention_period: 365d
shard_streams:
enabled: true
#logging_enabled: true
desired_rate: 1536KB # default is 1536KB
ingester:
replicas: 12
zoneAwareReplication:
enabled: false
extraEnv:
- name: GOMEMLIMIT
value: 5530MiB
resources:
requests:
cpu: 1000m
memory: 1024Mi
limits:
memory: 6Gi
distributor:
replicas: 6
maxUnavailable: 0
extraEnv:
- name: GOMEMLIMIT
value: 3686MiB
resources:
requests:
cpu: 500m
memory: 1024Mi
limits:
memory: 4Gi
During yesterday evening’s spike, I captured memory profiles:
Pod with high memory usage:
$ go tool pprof heap-loki-distributor-859d589cb6-d9r4s
File: loki
Build ID: 31b687fab4421fbfcda4d40cf228489ad2cac73d
Type: inuse_space
Time: 2025-12-01 19:14:05 EST
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) text
Showing nodes accounting for 673MB, 89.98% of 747.97MB total
Dropped 186 nodes (cum <= 3.74MB)
Showing top 10 nodes out of 95
flat flat% sum% cum cum%
220.75MB 29.51% 29.51% 220.75MB 29.51% google.golang.org/grpc/mem.NewTieredBufferPool.newSizedBufferPool.func1
168.12MB 22.48% 51.99% 168.12MB 22.48% github.com/grafana/loki/v3/pkg/distributor.(*Distributor).PushWithResolver.func3.2
109.12MB 14.59% 66.58% 109.12MB 14.59% github.com/grafana/loki/pkg/push.(*PushRequest).Marshal
64.95MB 8.68% 75.26% 64.95MB 8.68% go.opentelemetry.io/collector/pdata/internal/data/protogen/common/v1.(*AnyValue).Unmarshal
37.47MB 5.01% 80.27% 37.47MB 5.01% io.ReadAll
23.58MB 3.15% 83.43% 68.59MB 9.17% github.com/grafana/loki/v3/pkg/loghttp/push.otlpToLokiPushRequest
14MB 1.87% 85.30% 20.50MB 2.74% github.com/grafana/regexp.(*Regexp).ReplaceAllString
13.50MB 1.81% 87.10% 13.50MB 1.81% bytes.(*Buffer).String (inline)
11.96MB 1.60% 88.70% 11.96MB 1.60% github.com/grafana/loki/v3/pkg/distributor.(*Distributor).createShard
9.54MB 1.28% 89.98% 9.54MB 1.28% github.com/grafana/loki/v3/pkg/util/log.newPrometheusLogger.WithPrellocatedBuffer.func3
Pod with low memory usage:
File: loki
Build ID: 31b687fab4421fbfcda4d40cf228489ad2cac73d
Type: inuse_space
Time: 2025-12-01 19:15:21 EST
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) text
Showing nodes accounting for 56.73MB, 77.68% of 73.03MB total
Showing top 10 nodes out of 227
flat flat% sum% cum cum%
18.51MB 25.35% 25.35% 18.51MB 25.35% github.com/grafana/loki/v3/pkg/distributor.(*Distributor).PushWithResolver.func3.2
9.54MB 13.06% 38.41% 9.54MB 13.06% github.com/grafana/loki/v3/pkg/util/log.newPrometheusLogger.WithPrellocatedBuffer.func3
8.10MB 11.09% 49.50% 8.10MB 11.09% google.golang.org/grpc/mem.NewTieredBufferPool.newSizedBufferPool.func1
6.50MB 8.90% 58.40% 6.50MB 8.90% go.opentelemetry.io/collector/pdata/internal/data/protogen/common/v1.(*AnyValue).Unmarshal
3.50MB 4.79% 63.19% 3.50MB 4.79% bytes.(*Buffer).String
2.58MB 3.53% 66.72% 2.58MB 3.53% github.com/grafana/loki/v3/pkg/distributor.(*Distributor).createShard
2.50MB 3.42% 70.15% 2.50MB 3.42% github.com/aws/aws-sdk-go/aws/endpoints.init
2MB 2.74% 72.89% 2MB 2.74% github.com/prometheus/prometheus/model/labels.New
2MB 2.74% 75.63% 2MB 2.74% github.com/hashicorp/golang-lru/v2/internal.(*LruList[go.shape.string,go.shape.struct { github.com/grafana/loki/v3/pkg/distributor.ls github.com/prometheus/prometheus/model/labels.Labels; github.com/grafana/loki/v3/pkg/distributor.hash uint64 }]).insertValue
1.50MB 2.05% 77.68% 1.50MB 2.05% github.com/IBM/ibm-cos-sdk-go/aws/endpoints.init
We’re using opentelemetry-collector to ship container logs to the otlp endpoint (though we have seen this behavior when we were shipping data with promtail to the push endpoint).
Does anyone have any advice on further troubleshooting? The fact the only a single distributor is dealing with high memory usage & high latency is a mystery to us.



