Hello
I’m encountering a critical issue in my Grafana Loki cluster setup, and I’m seeking advice and assistance from the community to resolve it. Here are the details of my configuration and the problem I’m facing:
Configuration:
I have a Grafana Loki cluster consisting of multiple nodes with varying hardware specifications:
Read Nodes:
- RAM: 8GB
- Cores: 4
- Disk: 150GB
- Quantity: 5 nodes
Write Nodes:
- RAM: 8GB
- Cores: 4
- Disk: 150GB
- Quantity: 4 nodes
Promtail Nodes:
- RAM: 4GB
- Cores: 4
- Disk: 200GB
- Quantity: 2 nodes
Issue:
The issue arises when Promtail transfers logs from 3 Kafka nodes at a rate of approximately 800 GB per second. This high ingestion rate causes problems on the Loki write nodes, primarily:
- RAM Usage: The RAM on one of the write nodes increases significantly.
- Node Crashes: As a result of the increased RAM usage, one of the write nodes eventually crashes.
Loki Configuration:
Here’s a portion of my Loki configuration (common section) to provide context:
auth_enabled: true
server:
http_listen_address: 0.0.0.0
grpc_listen_address: 0.0.0.0
http_listen_port: 3100
grpc_listen_port: 9095
log_level: info
common:
ring:
kvstore:
store: memberlist
path_prefix: /etc/loki/
storage:
s3:
access_key_id: access_key
bucketnames: loki-main
endpoint: https://s3.node.net
insecure: false
region: null
s3: null
s3forcepathstyle: true
secret_access_key: secret_key
compactor_address: http://node-read1:3100
replication_factor: 3
memberlist:
join_members:
- node-read1:7946
- node-read2:7946
- node-read3:7946
- node-read4:7946
- node-read5:7946
- node-write1:7946
- node-write2:7946
- node-write3:7946
- node-write4:7946
dead_node_reclaim_time: 30s
gossip_to_dead_nodes_time: 15s
left_ingesters_timeout: 30s
gossip_interval: 2s
storage_config:
boltdb_shipper:
active_index_directory: /etc/loki/index
build_per_tenant_index: true
cache_location: /etc/loki/index_cache
shared_store: s3
tsdb_shipper:
active_index_directory: /etc/loki/tsdb-index
cache_location: /etc/loki/tsdb-cache
shared_store: s3
aws:
access_key_id: access_key
bucketnames: loki-main
endpoint: https://s3.node.net
insecure: false
region: null
s3: null
s3forcepathstyle: true
secret_access_key: secret_key
ingester:
lifecycler:
join_after: 10s
observe_period: 5s
ring:
replication_factor: 1
kvstore:
store: memberlist
final_sleep: 0s
chunk_idle_period: 1m
wal:
enabled: true
dir: /loki/wal
max_chunk_age: 1m
chunk_retain_period: 30s
chunk_encoding: snappy
chunk_target_size: 1.572864e+06
chunk_block_size: 262144
flush_op_timeout: 10s
ruler:
enable_api: true
enable_sharding: true
wal:
dir: /loki/ruler-wal
storage:
s3:
bucketnames: loki-data
distributor:
ring:
kvstore:
store: memberlist
schema_config:
configs:
- from: 2020-08-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
- from: 2023-07-11
store: tsdb
object_store: s3
schema: v12
index:
prefix: index_
period: 24h
chunks:
prefix: chunk_
period: 24h
limits_config:
max_cache_freshness_per_query: '10m'
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 48h
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
# parallelize queries in 15min intervals
split_queries_by_interval: 15m
volume_enabled: true
chunk_store_config:
max_look_back_period: 336h
table_manager:
retention_deletes_enabled: true
retention_period: 336h
query_range:
# make queries more cache-able by aligning them with their step intervals
align_queries_with_step: true
max_retries: 5
parallelise_shardable_queries: true
cache_results: true
frontend:
log_queries_longer_than: 5s
compress_responses: true
max_outstanding_per_tenant: 2048
grpc_client_config:
max_send_msg_size: 104857600
parallelism: 9
query_scheduler:
max_outstanding_requests_per_tenant: 32768
querier:
query_ingesters_within: 2h
compactor:
working_directory: /etc/loki/
shared_store: s3
compaction_interval: 5m
Docker Compose:
I’ve used Docker Compose to manage my Loki containers. Below are snippets of my Docker Compose configuration for both read and write nodes:
Read Nodes:
version: "3"
services:
loki:
image: grafana/loki:2.9.0
network_mode: "host"
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
command: -config.file=/etc/loki/local-config.yaml --target=read
Write Nodes:
version: "3"
services:
loki:
image: grafana/loki:2.9.0
network_mode: "host"
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
command: -config.file=/etc/loki/local-config.yaml --target=write
Promtail Configuration:
Here’s a snippet of my Promtail configuration for sending logs to Loki write nodes:
clients:
- url: http://write-node1:3100/loki/api/v1/push
- url: http://write-node2:3100/loki/api/v1/push
- url: http://write-node3:3100/loki/api/v1/push
- url: http://write-node4:3100/loki/api/v1/push
Request for Assistance:
I’m looking for guidance and recommendations on how to address the high RAM usage and node crashes on my Loki write nodes when dealing with such a high ingestion rate. Any advice on optimizing my configuration, resource allocation, or other best practices to ensure the stability and performance of my Loki cluster would be greatly appreciated.
Thank you for your help!