Hello
I have deplloyed Loki in a distributed deploymentMode.
I am using 9 loki ingester replicas and passing load into it through an OTEL Collector. Upon, the loki ingester memory spikes, I am seeing one of the ingesters not being able to handle the memory spike and crossing all memory limits to crash one ingester pod(ingester-8). I have dropped all load into it now, but the ingester is not able to recover and is throwing errors regarding to one of the index tables.
Loki Helm Version: 6.23.0
Kubernetes Versiob: 1.29.4
Ingester-8 logs:
level=info ts=2025-03-28T10:44:49.588270249Z caller=table_manager.go:136 index-store=tsdb-2024-12-01 msg="uploading tables"
level=info ts=2025-03-28T10:44:49.588407347Z caller=index_set.go:86 msg="uploading table loki_index_20175"
level=info ts=2025-03-28T10:48:23.298888098Z caller=index_set.go:107 msg="finished uploading table loki_index_20175"
level=info ts=2025-03-28T10:48:23.298916594Z caller=index_set.go:186 msg="cleaning up unwanted indexes from table loki_index_20175"
This log is getting repeatedly built up.
I have 2 compactors, which are producing these logs:
level=info ts=2025-03-28T07:46:30.961093011Z caller=compactor.go:774 msg="finished compacting table" table-name=loki_index_20175
level=info ts=2025-03-28T07:47:26.686529263Z caller=marker.go:202 msg="no marks file found"
This is the memory usage image of Loki:
I do not see any issues in any other components of the pipe except the ingester-8 pod crashing
I have followed the best practices doc for configuring Loki
Please suggest on how we can recovere from this situation.
Loki Helm Values.yaml:
backend:
replicas: 0
compactor:
enabled: true
extraArgs:
- '-config.expand-env=true'
kind: StatefulSet
persistence:
claims:
- name: data
size: 10Gi
storageClass: loki-storage
enabled: true
enableStatefulSetAutoDeletePVC: false
size: 10Gi
storageClass: loki-storage
whenDeleted: Retain
whenScaled: Retain
podAnnotations: null
podLabels: null
replicas: 2
resources:
limits:
cpu: 4
memory: 16Gi
requests:
cpu: 500m
memory: 1Gi
serviceAccount:
automountServiceAccountToken: true
create: false
name: lokisa
terminationGracePeriodSeconds: 120
chunksCache:
enabled: false
distributor:
autoscaling:
enabled: false
extraArgs:
- '-config.expand-env=true'
maxUnavailable: 2
replicas: 2
resources:
limits:
cpu: 8
memory: 16Gi
requests:
cpu: 100m
memory: 200Mi
terminationGracePeriodSeconds: 30
deploymentMode: Distributed
enterprise:
enabled: false
extraObjects:
- apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: loki-storage
provisioner: ebs.csi.aws.com
allowVolumeExpansion: true
parameters:
allowAutoIOPSPerGBIncrease: 'true'
encrypted: 'true'
iops: '16000'
throughput: '500'
type: 'gp3'
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
fullnameOverride: loki
gateway:
enabled: false
global:
image:
registry: null
indexGateway:
enabled: true
extraArgs:
- '-config.expand-env=true'
joinMemberlist: true
maxUnavailable: 2
persistence:
enabled: true
enableStatefulSetAutoDeletePVC: false
inMemory: false
size: 10Gi
storageClass: loki-storage
whenDeleted: Retain
whenScaled: Retain
podManagementPolicy: Parallel
replicas: 2
resources:
limits:
cpu: 4
memory: 16Gi
requests:
cpu: 500m
memory: 1Gi
terminationGracePeriodSeconds: 300
ingester:
autoscaling:
enabled: false
extraArgs:
- '-config.expand-env=true'
kind: StatefulSet
maxUnavailable: 2
persistence:
claims:
- name: data
size: 100Gi
storageClass: loki-storage
enabled: true
enableStatefulSetAutoDeletePVC: false
inMemory: false
whenDeleted: Retain
whenScaled: Retain
replicas: 3
resources:
limits:
cpu: 4
memory: 16Gi
requests:
cpu: 500m
memory: 1Gi
terminationGracePeriodSeconds: 300
zoneAwareReplication:
enabled: false
ingress:
enabled: false
loki:
auth_enabled: false
commonConfig:
path_prefix: /var/loki
replication_factor: 3
compactor_address: '{{ include "loki.compactorAddress" . }}'
compactor:
compaction_interval: 10m
delete_batch_size: 100
delete_request_store: s3
max_compaction_parallelism: 10
retention_enabled: true
retention_delete_delay: 1h
retention_delete_worker_count: 150
retention_table_timeout: 30m
working_directory: /var/loki/retention
configStorageType: ConfigMap
ingester:
chunk_block_size: 262144
chunk_encoding: snappy
chunk_idle_period: 30m
chunk_target_size: 1572864
concurrent_flushes: 128
flush_check_period: 10s
flush_op_backoff:
min_period: 10s
max_period: 1m
max_retries: 10
max_chunk_age: 2h
wal:
enabled: true
flush_on_shutdown: true
replay_memory_ceiling: 8GB
limits_config:
ingestion_rate_mb: 1024
ingestion_burst_size_mb: 2048
retention_period: 24h
livenessProbe:
httpGet:
path: /ready
port: http-metrics
initialDelaySeconds: 900
memcached:
chunk_cache:
enabled: false
results_cache:
enabled: false
podAnnotations:
prometheus.io.scrape: "true"
prometheus.io.port: "3100"
prometheus.io.path: "/metrics"
podLabels: null
readinessProbe:
httpGet:
path: /ready
port: http-metrics
initialDelaySeconds: 600
timeoutSeconds: 5
revisionHistoryLimit: 2
schemaConfig:
configs:
- from: "2024-12-01"
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
chunks:
prefix: loki_chunks_
period: 24h
storage:
bucketNames:
s3:
signature_version: v4
type: s3
storage_config:
tsdb_shipper:
active_index_directory: /var/loki/index
cache_location: /var/loki/index_cache
cache_ttl: 24h
index_gateway_client:
server_address: '{{ include "loki.indexGatewayAddress" . }}'
structuredConfig: { }
lokiCanary:
enabled: false
memcached:
memcachedExporter:
enabled: false
minio:
enabled: false
networkPolicy:
enabled: false
prometheusRule:
enabled: false
querier:
autoscaling:
enabled: false
extraArgs:
- '-config.expand-env=true'
maxUnavailable: 2
persistence:
enabled: true
size: 10Gi
storageClass: loki-storage
replicas: 2
resources:
limits:
cpu: 8
memory: 16Gi
requests:
cpu: 500m
memory: 1Gi
terminationGracePeriodSeconds: 120
queryFrontend:
autoscaling:
enabled: false
extraArgs:
- '-config.expand-env=true'
maxUnavailable: 2
replicas: 2
resources:
limits:
cpu: 8
memory: 16Gi
requests:
cpu: 100m
memory: 200Mi
terminationGracePeriodSeconds: 30
queryScheduler:
enabled: true
extraArgs:
- '-config.expand-env=true'
replicas: 2
resources:
limits:
cpu: 8
memory: 16Gi
requests:
cpu: 100m
memory: 200Mi
terminationGracePeriodSeconds: 30
read:
replicas: 0
resultsCache:
enabled: false
rollout_operator:
enabled: false
ruler:
enabled: false
serviceAccount:
automountServiceAccountToken: true
create: false
name: lokisa
serviceMonitor:
enabled: false
sidecar:
rules:
enabled: false
tableManager:
enabled: false
test:
enabled: false
write:
replicas: 0
The following screenshot shows other metrics regarding the Loki Ingester during the time the crash took place: