Hi everyone ![]()
We are evaluating a migration from Fluent Bit + OpenSearch to Fluent Bit + Loki for several reasons.
Right now, we are testing Loki with a subset of our logs, around 50 GB/day, which is about 20% of what we store in OpenSearch.
Ingestion performance is very good and as expected.
The main issue is query latency, even with this limited dataset.
Example queries and response times [range 24h]
Query like:
sum(count_over_time({index_name="NAME"} | json [$__auto]))
{index_name="NAME"} | json
Response time: > 1 minute
Unstructured query example (using a log field):
topk(10, sum by(log_request_path) (count_over_time({index_name="NAME"} |= `Optional Filter` | json [$__range])))
Response time: ~5 minutes
On OpenSearch, similar queries are almost instantaneous.
I understand Loki has different trade-offs, but I was expecting more acceptable query times for a dataset of this size.
Right now, I am the only one running queries. If we start creating dashboards and let development teams use Loki, the number of queries will increase fast, and performance might become a serious issue.
Setup and sizing
-
Data volume: ~50 GB/day
-
Storage: S3
-
Scaling: even with maximum scaling, Loki uses more than 2× the resources of our OpenSearch cluster
-
Performance: query speed does not improve with more resources
Component sizing:
Ingester: 2 replicas 1 CPU 2 Gi (tot 2 CPU / 4 Gi)
Distributor: 2 replicas 0.5 CPU 0.25 Gi (tot 1 CPU / 0.5 Gi)
IndexGateway: 4 replicas 0.1 CPU 0.25 Gi (tot 0.4 CPU / 1 Gi)
Querier: 2 replicas 4 CPU 2 Gi (tot 8 CPU / 4 Gi + autoscaling up to 50 replicas via KEDA)
QueryFrontend: 2 replicas 0.5 CPU 1 Gi (tot 1 CPU / 2 Gi)
QueryScheduler: 2 replicas 0.1 CPU 0.12 Gi (tot 0.2 CPU / 0.25 Gi)
Compactor: 1 replica 0.5 CPU 0.5 Gi
ChunkCache: 8 replicas 0.5 CPU 16 Gi (tot 4 CPU / 128 Gi - EBS 8×256 Gi)
ResultCache: 1 replica 0.5 CPU 1 Gi
Please note: even with this configuration, Loki uses more than double the resources of OpenSearch, but query performance does not improve.
Configuration:
analytics:
reporting_enabled: false
auth_enabled: true
bloom_build:
enabled: false
bloom_gateway:
enabled: false
chunk_store_config:
chunk_cache_config:
background:
writeback_buffer: 1000
writeback_goroutines: 1
writeback_size_limit: 500MB
default_validity: 1h
memcached:
batch_size: 16
parallelism: 24
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-chunks-cache.logging.svc
consistent_hash: true
max_idle_conns: 72
timeout: 200ms
common:
compactor_grpc_address: 'loki-compactor.logging.svc.cluster.local:9095'
path_prefix: /var/loki
replication_factor: 3
storage:
s3:
bucketnames: bucket-logging-chunks
insecure: false
region: us-east-1
s3forcepathstyle: false
compactor:
delete_request_store: s3
retention_enabled: true
distributor:
ring:
kvstore:
store: memberlist
frontend:
compress_responses: true
log_queries_longer_than: 5s
max_outstanding_per_tenant: 4096
scheduler_address: loki-query-scheduler.logging.svc.cluster.local:9095
tail_proxy_url: http://loki-querier.logging.svc.cluster.local:3100
frontend_worker:
scheduler_address: loki-query-scheduler.logging.svc.cluster.local:9095
index_gateway:
mode: simple
ingester:
autoforget_unhealthy: true
lifecycler:
final_sleep: 0s
ring:
heartbeat_timeout: 1m
kvstore:
store: memberlist
unregister_on_shutdown: true
limits_config:
discover_log_levels: false
max_cache_freshness_per_query: 10m
max_entries_limit_per_query: 100000
query_timeout: 300s
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_stream:
- period: 2160h
priority: 100
selector: '{retention="90d"}'
- period: 1440h
priority: 90
selector: '{retention="60d"}'
- period: 720h
priority: 80
selector: '{retention="30d"}'
- period: 336h
priority: 70
selector: '{retention="14d"}'
- period: 168h
priority: 60
selector: '{retention="7d"}'
- period: 168h
priority: 60
selector: '{retention="default"}'
- period: 72h
priority: 50
selector: '{retention="3d"}'
- period: 24h
priority: 41
selector: '{retention="2d"}'
- period: 24h
priority: 40
selector: '{retention="1d"}'
split_queries_by_interval: 1h
volume_enabled: true
memberlist:
dead_node_reclaim_time: 30s
gossip_to_dead_nodes_time: 15s
join_members:
- loki-memberlist
left_ingesters_timeout: 30s
pattern_ingester:
enabled: false
querier:
max_concurrent: 4
multi_tenant_queries_enabled: true
query_ingesters_within: 3h
query_range:
align_queries_with_step: true
cache_results: true
results_cache:
cache:
background:
writeback_buffer: 500000
writeback_goroutines: 1
writeback_size_limit: 1GB
default_validity: 12h
memcached_client:
addresses: dnssrvnoa+_memcached-client._tcp.loki-results-cache.logging.svc
consistent_hash: true
timeout: 2000ms
update_interval: 1m
ruler:
storage:
s3:
bucketnames: bucket-logging-ruler
insecure: false
region: us-east-1
s3forcepathstyle: false
type: s3
wal:
dir: /var/loki/ruler-wal
runtime_config:
file: /etc/loki/runtime-config/runtime-config.yaml
schema_config:
configs:
- from: "2025-05-22"
index:
period: 24h
prefix: loki_index_
object_store: s3
schema: v13
store: tsdb
server:
grpc_listen_port: 9095
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 104857600
http_listen_port: 3100
http_server_read_timeout: 600s
http_server_write_timeout: 600s
storage_config:
bloom_shipper:
working_directory: /var/loki/data/bloomshipper
boltdb_shipper:
index_gateway_client:
server_address: dns+loki-index-gateway-headless.logging.svc.cluster.local:9095
hedging:
at: 250ms
max_per_second: 20
up_to: 3
tsdb_shipper:
index_gateway_client:
server_address: dns+loki-index-gateway-headless.logging.svc.cluster.local:9095
use_thanos_objstore: false
tracing:
enabled: false
ui:
enabled: true
FluentBit Output
labels:
- service_name=fluent-operator
- cluster=eks-production
- index_name=$index_name
- namespace=$kubernetes['namespace_name']
- retention=$retention
structuredMetadata:
level: $level
kube_pod_name: $kubernetes['pod_name']
kube_pod_ip: $kubernetes['pod_ip']
kube_host: $kubernetes['host']
kube_container: $kubernetes['container_name']
kube_app_kubernetes_io_name: $kubernetes['labels']['app.kubernetes.io/name']
kube_app_kubernetes_io_instance: $kubernetes['labels']['app.kubernetes.io/instance']
kube_app_kubernetes_io_component: $kubernetes['labels']['app.kubernetes.io/component']
kube_app_kubernetes_io_part_of: $kubernetes['labels']['app.kubernetes.io/part-of']
removeKeys:
- kubernetes
Has anyone faced similar performance issues with Loki?
Do you have suggestions on how to optimize query latency?
Thanks a lot for any advice