High memory usage in a single distributor

AnomaloVince · December 2, 2025, 6:06pm

We’re using Loki for log aggregation across multiple EKS and AKS clusters, with highly dynamic workloads. We’re deployed in distributed/microservices mode with 6 distributors and 12 ingesters.

One of our workloads scales massively out (from 10-200 pods to 2400 pods) every night for roughly 1 hour. During this scale out period, we consistently see a single distributor (and often the same distributor when Loki pods haven’t been rotated during the day) with much higher memory usage than the other. A portion of requests to that distributor fail.

Requests are evenly distributed to all distributors (both in terms of lines per sec and MB per sec).

Ingesters’ CPU & memory is pretty consistent with no significant outliers.

Our metrics paint the picture:

Top level volume:

Write latency spikes:

Distributor metrics (note the single distributor with high memory):

Ingester metrics:

Relevant portions from our helm values:

loki:
  limits_config:
    allow_structured_metadata: true
    ingestion_rate_mb: 128
    ingestion_burst_size_mb: 256
    max_label_names_per_series: 20
    max_line_size_truncate: true  # truncate lines that exceed max_line_size (256KB)
    max_streams_per_user: 0 # Maximum number of active streams per user, per ingester. 0 to disable.
    max_global_streams_per_user: 10000 # Maximum number of active streams per user, across the cluster (default 5000)
    max_query_series: 2000
    retention_period: 365d
    shard_streams:
      enabled: true
      #logging_enabled: true
      desired_rate: 1536KB  # default is 1536KB

ingester:
  replicas: 12
  zoneAwareReplication:
    enabled: false
  extraEnv:
    - name: GOMEMLIMIT
      value: 5530MiB
  resources:
    requests:
      cpu: 1000m
      memory: 1024Mi
    limits:
      memory: 6Gi

distributor:
  replicas: 6
  maxUnavailable: 0
  extraEnv:
    - name: GOMEMLIMIT
      value: 3686MiB
  resources:
    requests:
      cpu: 500m
      memory: 1024Mi
    limits:
      memory: 4Gi

During yesterday evening’s spike, I captured memory profiles:

Pod with high memory usage:

$ go tool pprof heap-loki-distributor-859d589cb6-d9r4s
File: loki
Build ID: 31b687fab4421fbfcda4d40cf228489ad2cac73d
Type: inuse_space
Time: 2025-12-01 19:14:05 EST
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) text
Showing nodes accounting for 673MB, 89.98% of 747.97MB total
Dropped 186 nodes (cum <= 3.74MB)
Showing top 10 nodes out of 95
      flat  flat%   sum%        cum   cum%
  220.75MB 29.51% 29.51%   220.75MB 29.51%  google.golang.org/grpc/mem.NewTieredBufferPool.newSizedBufferPool.func1
  168.12MB 22.48% 51.99%   168.12MB 22.48%  github.com/grafana/loki/v3/pkg/distributor.(*Distributor).PushWithResolver.func3.2
  109.12MB 14.59% 66.58%   109.12MB 14.59%  github.com/grafana/loki/pkg/push.(*PushRequest).Marshal
   64.95MB  8.68% 75.26%    64.95MB  8.68%  go.opentelemetry.io/collector/pdata/internal/data/protogen/common/v1.(*AnyValue).Unmarshal
   37.47MB  5.01% 80.27%    37.47MB  5.01%  io.ReadAll
   23.58MB  3.15% 83.43%    68.59MB  9.17%  github.com/grafana/loki/v3/pkg/loghttp/push.otlpToLokiPushRequest
      14MB  1.87% 85.30%    20.50MB  2.74%  github.com/grafana/regexp.(*Regexp).ReplaceAllString
   13.50MB  1.81% 87.10%    13.50MB  1.81%  bytes.(*Buffer).String (inline)
   11.96MB  1.60% 88.70%    11.96MB  1.60%  github.com/grafana/loki/v3/pkg/distributor.(*Distributor).createShard
    9.54MB  1.28% 89.98%     9.54MB  1.28%  github.com/grafana/loki/v3/pkg/util/log.newPrometheusLogger.WithPrellocatedBuffer.func3

Pod with low memory usage:

File: loki
Build ID: 31b687fab4421fbfcda4d40cf228489ad2cac73d
Type: inuse_space
Time: 2025-12-01 19:15:21 EST
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) text
Showing nodes accounting for 56.73MB, 77.68% of 73.03MB total
Showing top 10 nodes out of 227
      flat  flat%   sum%        cum   cum%
   18.51MB 25.35% 25.35%    18.51MB 25.35%  github.com/grafana/loki/v3/pkg/distributor.(*Distributor).PushWithResolver.func3.2
    9.54MB 13.06% 38.41%     9.54MB 13.06%  github.com/grafana/loki/v3/pkg/util/log.newPrometheusLogger.WithPrellocatedBuffer.func3
    8.10MB 11.09% 49.50%     8.10MB 11.09%  google.golang.org/grpc/mem.NewTieredBufferPool.newSizedBufferPool.func1
    6.50MB  8.90% 58.40%     6.50MB  8.90%  go.opentelemetry.io/collector/pdata/internal/data/protogen/common/v1.(*AnyValue).Unmarshal
    3.50MB  4.79% 63.19%     3.50MB  4.79%  bytes.(*Buffer).String
    2.58MB  3.53% 66.72%     2.58MB  3.53%  github.com/grafana/loki/v3/pkg/distributor.(*Distributor).createShard
    2.50MB  3.42% 70.15%     2.50MB  3.42%  github.com/aws/aws-sdk-go/aws/endpoints.init
       2MB  2.74% 72.89%        2MB  2.74%  github.com/prometheus/prometheus/model/labels.New
       2MB  2.74% 75.63%        2MB  2.74%  github.com/hashicorp/golang-lru/v2/internal.(*LruList[go.shape.string,go.shape.struct { github.com/grafana/loki/v3/pkg/distributor.ls github.com/prometheus/prometheus/model/labels.Labels; github.com/grafana/loki/v3/pkg/distributor.hash uint64 }]).insertValue
    1.50MB  2.05% 77.68%     1.50MB  2.05%  github.com/IBM/ibm-cos-sdk-go/aws/endpoints.init

We’re using opentelemetry-collector to ship container logs to the otlp endpoint (though we have seen this behavior when we were shipping data with promtail to the push endpoint).

Does anyone have any advice on further troubleshooting? The fact the only a single distributor is dealing with high memory usage & high latency is a mystery to us.

tonyswumac · December 4, 2025, 8:57pm

Can you hit the /ring endpoint on your distributor and see if all distributors are listed?
Looks like you have shard_streams enabled already. It’s possible that sharding happens after distributor, but I am not 100% certain. I’d try changing desired_rate to a bigger value and see if that helps.

AnomaloVince · December 8, 2025, 12:41pm

Sharding indeed happens on the distributors.

We’re actually trying a slightly lower desired_rate. (And we’ve removed k8s node name as a stream label. We had temporarily added that, it was there was when I originally posted this.)

Unfortunately (or fortunately, depending on how you look at it), the workload that experiences these spikes has stopped having these scaling events. (They’re driven by our customers’ usage of our platform.) So, now we’re not seeing the single distributor with high memory usage.

Topic		Replies	Views
Loki Ingester continously crashing with high memory spikes Grafana Loki	2	233	April 1, 2026
Tips for troubleshooting Ingester pod memory imbalance when running Loki in distributed mode in Kubernetes Grafana Loki loki , performance	3	884	April 5, 2024
Pod "ingester" consumes too much memory Grafana Loki loki	2	162	April 30, 2026
Loki Distributor outbound traffic to ingester drop Grafana Loki loki	3	611	April 18, 2025
Ingester high memory usage Grafana Loki loki , configuration	2	2381	December 12, 2023

High memory usage in a single distributor

Related topics