High memory usage in a single distributor

We’re using Loki for log aggregation across multiple EKS and AKS clusters, with highly dynamic workloads. We’re deployed in distributed/microservices mode with 6 distributors and 12 ingesters.

One of our workloads scales massively out (from 10-200 pods to 2400 pods) every night for roughly 1 hour. During this scale out period, we consistently see a single distributor (and often the same distributor when Loki pods haven’t been rotated during the day) with much higher memory usage than the other. A portion of requests to that distributor fail.

Requests are evenly distributed to all distributors (both in terms of lines per sec and MB per sec).

Ingesters’ CPU & memory is pretty consistent with no significant outliers.

Our metrics paint the picture:

Top level volume:

Write latency spikes:

Distributor metrics (note the single distributor with high memory):

Ingester metrics:

Relevant portions from our helm values:

loki:
  limits_config:
    allow_structured_metadata: true
    ingestion_rate_mb: 128
    ingestion_burst_size_mb: 256
    max_label_names_per_series: 20
    max_line_size_truncate: true  # truncate lines that exceed max_line_size (256KB)
    max_streams_per_user: 0 # Maximum number of active streams per user, per ingester. 0 to disable.
    max_global_streams_per_user: 10000 # Maximum number of active streams per user, across the cluster (default 5000)
    max_query_series: 2000
    retention_period: 365d
    shard_streams:
      enabled: true
      #logging_enabled: true
      desired_rate: 1536KB  # default is 1536KB

ingester:
  replicas: 12
  zoneAwareReplication:
    enabled: false
  extraEnv:
    - name: GOMEMLIMIT
      value: 5530MiB
  resources:
    requests:
      cpu: 1000m
      memory: 1024Mi
    limits:
      memory: 6Gi

distributor:
  replicas: 6
  maxUnavailable: 0
  extraEnv:
    - name: GOMEMLIMIT
      value: 3686MiB
  resources:
    requests:
      cpu: 500m
      memory: 1024Mi
    limits:
      memory: 4Gi

During yesterday evening’s spike, I captured memory profiles:

Pod with high memory usage:

$ go tool pprof heap-loki-distributor-859d589cb6-d9r4s
File: loki
Build ID: 31b687fab4421fbfcda4d40cf228489ad2cac73d
Type: inuse_space
Time: 2025-12-01 19:14:05 EST
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) text
Showing nodes accounting for 673MB, 89.98% of 747.97MB total
Dropped 186 nodes (cum <= 3.74MB)
Showing top 10 nodes out of 95
      flat  flat%   sum%        cum   cum%
  220.75MB 29.51% 29.51%   220.75MB 29.51%  google.golang.org/grpc/mem.NewTieredBufferPool.newSizedBufferPool.func1
  168.12MB 22.48% 51.99%   168.12MB 22.48%  github.com/grafana/loki/v3/pkg/distributor.(*Distributor).PushWithResolver.func3.2
  109.12MB 14.59% 66.58%   109.12MB 14.59%  github.com/grafana/loki/pkg/push.(*PushRequest).Marshal
   64.95MB  8.68% 75.26%    64.95MB  8.68%  go.opentelemetry.io/collector/pdata/internal/data/protogen/common/v1.(*AnyValue).Unmarshal
   37.47MB  5.01% 80.27%    37.47MB  5.01%  io.ReadAll
   23.58MB  3.15% 83.43%    68.59MB  9.17%  github.com/grafana/loki/v3/pkg/loghttp/push.otlpToLokiPushRequest
      14MB  1.87% 85.30%    20.50MB  2.74%  github.com/grafana/regexp.(*Regexp).ReplaceAllString
   13.50MB  1.81% 87.10%    13.50MB  1.81%  bytes.(*Buffer).String (inline)
   11.96MB  1.60% 88.70%    11.96MB  1.60%  github.com/grafana/loki/v3/pkg/distributor.(*Distributor).createShard
    9.54MB  1.28% 89.98%     9.54MB  1.28%  github.com/grafana/loki/v3/pkg/util/log.newPrometheusLogger.WithPrellocatedBuffer.func3

Pod with low memory usage:

File: loki
Build ID: 31b687fab4421fbfcda4d40cf228489ad2cac73d
Type: inuse_space
Time: 2025-12-01 19:15:21 EST
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) text
Showing nodes accounting for 56.73MB, 77.68% of 73.03MB total
Showing top 10 nodes out of 227
      flat  flat%   sum%        cum   cum%
   18.51MB 25.35% 25.35%    18.51MB 25.35%  github.com/grafana/loki/v3/pkg/distributor.(*Distributor).PushWithResolver.func3.2
    9.54MB 13.06% 38.41%     9.54MB 13.06%  github.com/grafana/loki/v3/pkg/util/log.newPrometheusLogger.WithPrellocatedBuffer.func3
    8.10MB 11.09% 49.50%     8.10MB 11.09%  google.golang.org/grpc/mem.NewTieredBufferPool.newSizedBufferPool.func1
    6.50MB  8.90% 58.40%     6.50MB  8.90%  go.opentelemetry.io/collector/pdata/internal/data/protogen/common/v1.(*AnyValue).Unmarshal
    3.50MB  4.79% 63.19%     3.50MB  4.79%  bytes.(*Buffer).String
    2.58MB  3.53% 66.72%     2.58MB  3.53%  github.com/grafana/loki/v3/pkg/distributor.(*Distributor).createShard
    2.50MB  3.42% 70.15%     2.50MB  3.42%  github.com/aws/aws-sdk-go/aws/endpoints.init
       2MB  2.74% 72.89%        2MB  2.74%  github.com/prometheus/prometheus/model/labels.New
       2MB  2.74% 75.63%        2MB  2.74%  github.com/hashicorp/golang-lru/v2/internal.(*LruList[go.shape.string,go.shape.struct { github.com/grafana/loki/v3/pkg/distributor.ls github.com/prometheus/prometheus/model/labels.Labels; github.com/grafana/loki/v3/pkg/distributor.hash uint64 }]).insertValue
    1.50MB  2.05% 77.68%     1.50MB  2.05%  github.com/IBM/ibm-cos-sdk-go/aws/endpoints.init

We’re using opentelemetry-collector to ship container logs to the otlp endpoint (though we have seen this behavior when we were shipping data with promtail to the push endpoint).

Does anyone have any advice on further troubleshooting? The fact the only a single distributor is dealing with high memory usage & high latency is a mystery to us.

  1. Can you hit the /ring endpoint on your distributor and see if all distributors are listed?
  2. Looks like you have shard_streams enabled already. It’s possible that sharding happens after distributor, but I am not 100% certain. I’d try changing desired_rate to a bigger value and see if that helps.