Loki Ingester continously crashing with high memory spikes

Hello
I have deplloyed Loki in a distributed deploymentMode.
I am using 9 loki ingester replicas and passing load into it through an OTEL Collector. Upon, the loki ingester memory spikes, I am seeing one of the ingesters not being able to handle the memory spike and crossing all memory limits to crash one ingester pod(ingester-8). I have dropped all load into it now, but the ingester is not able to recover and is throwing errors regarding to one of the index tables.

Loki Helm Version: 6.23.0
Kubernetes Versiob: 1.29.4

Ingester-8 logs:

level=info ts=2025-03-28T10:44:49.588270249Z caller=table_manager.go:136 index-store=tsdb-2024-12-01 msg="uploading tables"
level=info ts=2025-03-28T10:44:49.588407347Z caller=index_set.go:86 msg="uploading table loki_index_20175"
level=info ts=2025-03-28T10:48:23.298888098Z caller=index_set.go:107 msg="finished uploading table loki_index_20175"
level=info ts=2025-03-28T10:48:23.298916594Z caller=index_set.go:186 msg="cleaning up unwanted indexes from table loki_index_20175"

This log is getting repeatedly built up.

I have 2 compactors, which are producing these logs:

level=info ts=2025-03-28T07:46:30.961093011Z caller=compactor.go:774 msg="finished compacting table" table-name=loki_index_20175
level=info ts=2025-03-28T07:47:26.686529263Z caller=marker.go:202 msg="no marks file found"

This is the memory usage image of Loki:

I do not see any issues in any other components of the pipe except the ingester-8 pod crashing
I have followed the best practices doc for configuring Loki
Please suggest on how we can recovere from this situation.

Loki Helm Values.yaml:

backend:
  replicas: 0
compactor:
  enabled: true
  extraArgs:
    - '-config.expand-env=true'
  kind: StatefulSet
  persistence:
    claims:
      - name: data
        size: 10Gi
        storageClass: loki-storage
    enabled: true
    enableStatefulSetAutoDeletePVC: false
    size: 10Gi
    storageClass: loki-storage
    whenDeleted: Retain
    whenScaled: Retain
  podAnnotations: null
  podLabels: null
  replicas: 2
  resources:
    limits:
      cpu: 4
      memory: 16Gi
    requests:
      cpu: 500m
      memory: 1Gi
  serviceAccount:
    automountServiceAccountToken: true
    create: false
    name: lokisa
  terminationGracePeriodSeconds: 120
chunksCache:
  enabled: false
distributor:
  autoscaling:
    enabled: false
  extraArgs:
    - '-config.expand-env=true'
  maxUnavailable: 2
  replicas: 2
  resources:
    limits:
      cpu: 8
      memory: 16Gi
    requests:
      cpu: 100m
      memory: 200Mi
  terminationGracePeriodSeconds: 30
deploymentMode: Distributed
enterprise:
  enabled: false
extraObjects:
  - apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: loki-storage
    provisioner: ebs.csi.aws.com
    allowVolumeExpansion: true
    parameters:
      allowAutoIOPSPerGBIncrease: 'true'
      encrypted: 'true'
      iops: '16000'
      throughput: '500'
      type: 'gp3'
    reclaimPolicy: Retain
    volumeBindingMode: WaitForFirstConsumer
fullnameOverride: loki
gateway:
  enabled: false
global:
  image:
    registry: null
indexGateway:
  enabled: true
  extraArgs:
    - '-config.expand-env=true'
  joinMemberlist: true
  maxUnavailable: 2
  persistence:
    enabled: true
    enableStatefulSetAutoDeletePVC: false
    inMemory: false
    size: 10Gi
    storageClass: loki-storage
    whenDeleted: Retain
    whenScaled: Retain
  podManagementPolicy: Parallel
  replicas: 2
  resources:
    limits:
      cpu: 4
      memory: 16Gi
    requests:
      cpu: 500m
      memory: 1Gi
  terminationGracePeriodSeconds: 300
ingester:
  autoscaling:
    enabled: false
  extraArgs:
    - '-config.expand-env=true'
  kind: StatefulSet
  maxUnavailable: 2
  persistence:
    claims:
      - name: data
        size: 100Gi
        storageClass: loki-storage
    enabled: true
    enableStatefulSetAutoDeletePVC: false
    inMemory: false
    whenDeleted: Retain
    whenScaled: Retain
  replicas: 3
  resources:
    limits:
      cpu: 4
      memory: 16Gi
    requests:
      cpu: 500m
      memory: 1Gi
  terminationGracePeriodSeconds: 300
  zoneAwareReplication:
    enabled: false
ingress:
  enabled: false
loki:
  auth_enabled: false
  commonConfig:
    path_prefix: /var/loki
    replication_factor: 3
    compactor_address: '{{ include "loki.compactorAddress" . }}'
  compactor:
    compaction_interval: 10m
    delete_batch_size: 100
    delete_request_store: s3
    max_compaction_parallelism: 10
    retention_enabled: true
    retention_delete_delay: 1h
    retention_delete_worker_count: 150
    retention_table_timeout: 30m
    working_directory: /var/loki/retention
  configStorageType: ConfigMap
  ingester:
    chunk_block_size: 262144
    chunk_encoding: snappy
    chunk_idle_period: 30m
    chunk_target_size: 1572864
    concurrent_flushes: 128
    flush_check_period: 10s
    flush_op_backoff:
      min_period: 10s
      max_period: 1m
      max_retries: 10
    max_chunk_age: 2h
    wal:
      enabled: true
      flush_on_shutdown: true
      replay_memory_ceiling: 8GB
  limits_config:
    ingestion_rate_mb: 1024
    ingestion_burst_size_mb: 2048
    retention_period: 24h
  livenessProbe:
    httpGet:
      path: /ready
      port: http-metrics
    initialDelaySeconds: 900
  memcached:
    chunk_cache:
      enabled: false
    results_cache:
      enabled: false
  podAnnotations:
    prometheus.io.scrape: "true"
    prometheus.io.port: "3100"
    prometheus.io.path: "/metrics"
  podLabels: null
  readinessProbe:
    httpGet:
      path: /ready
      port: http-metrics
    initialDelaySeconds: 600
    timeoutSeconds: 5
  revisionHistoryLimit: 2
  schemaConfig:
    configs:
      - from: "2024-12-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
        chunks:
          prefix: loki_chunks_
          period: 24h
  storage:
    bucketNames:
    s3:
      signature_version: v4
    type: s3
  storage_config:
    tsdb_shipper:
      active_index_directory: /var/loki/index
      cache_location: /var/loki/index_cache
      cache_ttl: 24h
      index_gateway_client:
        server_address: '{{ include "loki.indexGatewayAddress" . }}'
  structuredConfig: { }
lokiCanary:
  enabled: false
memcached:
memcachedExporter:
  enabled: false
minio:
  enabled: false
networkPolicy:
  enabled: false
prometheusRule:
  enabled: false
querier:
  autoscaling:
    enabled: false
  extraArgs:
    - '-config.expand-env=true'
  maxUnavailable: 2
  persistence:
    enabled: true
    size: 10Gi
    storageClass: loki-storage
  replicas: 2
  resources:
    limits:
      cpu: 8
      memory: 16Gi
    requests:
      cpu: 500m
      memory: 1Gi
  terminationGracePeriodSeconds: 120
queryFrontend:
  autoscaling:
    enabled: false
  extraArgs:
    - '-config.expand-env=true'
  maxUnavailable: 2
  replicas: 2
  resources:
    limits:
      cpu: 8
      memory: 16Gi
    requests:
      cpu: 100m
      memory: 200Mi
  terminationGracePeriodSeconds: 30
queryScheduler:
  enabled: true
  extraArgs:
    - '-config.expand-env=true'
  replicas: 2
  resources:
    limits:
      cpu: 8
      memory: 16Gi
    requests:
      cpu: 100m
      memory: 200Mi
  terminationGracePeriodSeconds: 30
read:
  replicas: 0
resultsCache:
  enabled: false
rollout_operator:
  enabled: false
ruler:
  enabled: false
serviceAccount:
  automountServiceAccountToken: true
  create: false
  name: lokisa
serviceMonitor:
  enabled: false
sidecar:
  rules:
    enabled: false
tableManager:
  enabled: false
test:
  enabled: false
write:
  replicas: 0

The following screenshot shows other metrics regarding the Loki Ingester during the time the crash took place:

First I would confirm your ingesters are actually forming a cluster. You can check for this by hitting the /ring API and see if all members are returned.

Second, if you have uneven log streams (some big some small) you may want to consider enabling sharding. See Manage large volume log streams with automatic stream sharding | Grafana Loki documentation.

1 Like