I’m setting up a self-hosted Loki deployment on AWS EC2 (m4.xlarge
) using the simple scalable deployment mode, with AWS S3 as the object store. Here’s what my setup looks like:
- 6 read pods
- 3 write pods
- 3 backend pods
- 1 read-cache and 1 write-cache pod (using Memcached)
- CPU usage is under 10%, and I have around 8 GiB of free RAM.
Despite this, query performance is very poor. Even a basic query over the last 30 minutes (~2.1 GB of data) gets timeout and takes 2–3 tries to complete, which feels too slow. In many cases, queries are timing out, and I haven’t found any helpful errors in the logs.I suspect the issue might be related to parallelization settings, or chunk-related configs (like chunk size or age for flushing), but I’m having a hard time figuring out an ideal configuration.My goal is to fully utilize the available AWS resources and bring query times down to a few seconds for small queries, and ideally no more than ~30 seconds for large queries over tens of GBs.Would really appreciate any insights, tuning tips, or configuration advice from anyone who’s had success optimizing Loki performance in a similar setup.
My current loki configuration.
server:
http_listen_port: 3100
grpc_listen_port: 9095
memberlist:
join_members:
- loki-backend:7946
bind_port: 7946
common:
replication_factor: 3
compactor_address: http://loki-backend:3100
path_prefix: /var/loki
storage:
s3:
bucketnames: stage-loki-chunks
region: ap-south-1
ring:
kvstore:
store: memberlist
compactor:
working_directory: /var/loki/retention
compaction_interval: 10m
retention_enabled: false # Disabled retention deletion
ingester:
chunk_idle_period: 1h
wal:
enabled: true
dir: /var/loki/wal
max_chunk_age: 1h
chunk_retain_period: 3h
chunk_encoding: snappy
chunk_target_size: 5242880
chunk_block_size: 262144
limits_config:
allow_structured_metadata: true
ingestion_rate_mb: 20
ingestion_burst_size_mb: 40
split_queries_by_interval: 15m
max_query_parallelism: 32
max_query_series: 10000
query_timeout: 5m
tsdb_max_query_parallelism: 32
# Write path caching (for chunks)
chunk_store_config:
chunk_cache_config:
memcached:
batch_size: 64
parallelism: 8
memcached_client:
addresses: write-cache:11211
max_idle_conns: 16
timeout: 200ms
# Read path caching (for query results)
query_range:
align_queries_with_step: true
cache_results: true
results_cache:
cache:
default_validity: 24h
memcached:
expiration: 24h
batch_size: 64
parallelism: 32
memcached_client:
addresses: read-cache:11211
max_idle_conns: 32
timeout: 200ms
pattern_ingester:
enabled: true
querier:
max_concurrent: 20
frontend:
log_queries_longer_than: 5s
compress_responses: true
ruler:
storage:
type: s3
s3:
bucketnames: stage-loki-ruler
region: ap-south-1
s3forcepathstyle: false
schema_config:
configs:
- from: "2024-04-01"
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
storage_config:
aws:
s3forcepathstyle: false
s3: https://s3.region-name.amazonaws.com
tsdb_shipper:
query_ready_num_days: 1
active_index_directory: /var/loki/tsdb-index
cache_location: /var/loki/tsdb-cache
cache_ttl: 24h