Here’s a clean, detailed draft you can use to create a post on the Grafana Community Forum. It explains the performance issue with enough technical depth to invite helpful responses:
Title:
Loki Queries Taking Too Long When Using S3 Storage Backend (Loki v3.x)
Body:
Hi all,
We’re currently facing performance issues with Grafana Loki when running log queries through Grafana. Below are the details of our setup and the problem we are encountering:
Environment Details:
- Grafana Loki Version: v3.x (recent)
- Deployment Mode: Distributed
- Storage Backend: Amazon S3 (TSDB schema v13)
- Log Volume: Moderate to high (structured logs from EKS workloads)
- Querier Setup: Querier + Query-Frontend + Distributor setup with autoscaling
- S3 Configuration: Using IRSA for S3 access
- Index Gateway: Enabled
- Auth Mode: Single-tenant, auth disabled
- Compactor: Enabled and running periodically
- Chunk Compression: Snappy
Issue:
When running log queries in Grafana (even with short durations like 5 days), the response time is very slow, often taking 3–5 minutes or more. For longer durations (10+ days), we sometimes hit 504 Gateway Timeout errors.
Even with tsdb_max_query_parallelism and querier.max_concurrent set to fairly high values, performance does not improve significantly.
Tuning Parameters Already Applied:
limits_config:
tsdb_max_query_parallelism: 16
max_query_lookback: 30d
querier:
max_concurrent: 4
query_range:
split_queries_by_interval: 1h
parallelise_shardable_queries: true
Loki Configuration:
loki:
auth_enabled: false # Disable basic auth; assumes secured access via other means
schemaConfig:
configs:
- from: "2025-06-01"
store: tsdb # Use TSDB for storage format
object_store: s3 # Store logs in AWS S3
schema: v13 # Loki schema version
index:
prefix: index_ # Prefix for index files in object store
period: 24h # Index period (rollover every 24h)
storage:
bucketNames:
chunks: monitoring-loki # S3 bucket for storing chunks
storage_config:
aws:
bucketnames: monitoring-loki # S3 bucket name
region: us-east-1
s3forcepathstyle: true # Required for some S3-compatible stores
sse:
type: SSE-S3 # Enable server-side encryption
tsdb_shipper:
active_index_directory: /var/loki/index
cache_location: /var/loki/index_cache # Local index cache
server:
http_server_read_timeout: 600s
http_server_write_timeout: 600s
http_server_idle_timeout: 1200s
grpc_server_max_recv_msg_size: 16777216 # Max gRPC receive size (16MB)
grpc_server_max_send_msg_size: 16777216 # Max gRPC send size (16MB)
ingester:
autoforget_unhealthy: true
chunk_encoding: snappy # Compress chunks with snappy
# Controls how long to keep a chunk open if no new logs arrive
chunk_idle_period: 2m # faster flushing helps during pod shutdowns
# Maximum age of a chunk before it's flushed
max_chunk_age: 1h # balance between memory usage and flush frequency
# How long to retain flushed chunks in memory
chunk_retain_period: 1h # Retain flushed chunks in memory
wal:
enabled: true
dir: /var/loki/wal
flush_on_shutdown: true
checkpoint_duration: 2m # More frequent checkpoints = faster recovery after crash
pattern_ingester:
enabled: false # Disabled by default
limits_config:
max_query_length: 30d
max_query_parallelism: 16
max_chunks_per_query: 2000000
split_queries_by_interval: 1h # Split queries into 1-hour intervals
query_timeout: 5m
max_entries_limit_per_query: 10000
allow_structured_metadata: true
volume_enabled: true
retention_period: 2160h # Global retention (90 days)
retention_stream:
- selector: '{type="archive"}'
priority: 1
period: 438300h # Retention for archive logs (50 years)
ingestion_rate_mb: 20
ingestion_burst_size_mb: 40
per_stream_rate_limit: 20M
per_stream_rate_limit_burst: 40M
max_line_size: 4194304 # Max line size for logs (4MB)
increment_duplicate_timestamp: true # Increment duplicate timestamps to avoid issues with log ingestion
compactor:
retention_enabled: true
apply_retention_interval: 10m
delete_request_store: s3 # Delete data directly from S3
querier:
max_concurrent: 4 # Max concurrent queries
query_store_only: true # Only query the store, not the ingesters
query_range:
align_queries_with_step: true
cache_results: true # Enable query result caching
cache_config:
default_validity: 2h
embedded_cache:
enabled: false # Disable embedded cache
max_size_mb: 512 # Max size for embedded cache
ttl: 2h
serviceAccount:
create: true
name: loki
annotations:
"eks.amazonaws.com/role-arn": "arn:aws:iam::xxxxxxxxx:role/monitoring-loki"
deploymentMode: Distributed # Enable distributed mode
# -------------------------------
# Ingesters
# -------------------------------
ingester:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "wget -qO- localhost:3100/ingester/shutdown && sleep 300"]
zoneAwareReplication:
# Assign nodepool per zone
zoneA:
nodeSelector:
karpenter.sh/nodepool: monitoring-write
zoneB:
nodeSelector:
karpenter.sh/nodepool: monitoring-write
zoneC:
nodeSelector:
karpenter.sh/nodepool: monitoring-write
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 21
targetCPUUtilizationPercentage: 80
targetMemoryUtilizationPercentage: 80
persistence:
enabled: true
size: 10Gi
storageClass: gp3
accessModes:
- ReadWriteOnce
resources:
requests:
cpu: "200m"
memory: "2Gi"
limits:
cpu: "300m"
memory: "3Gi"
# -------------------------------
# Compactor
# -------------------------------
compactor:
replicas: 1
nodeSelector:
karpenter.sh/nodepool: monitoring-write
persistence:
enabled: true
size: 10Gi
storageClass: gp3
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "250m"
memory: "512Mi"
# -------------------------------
# Distributor
# -------------------------------
distributor:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 20"]
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 20
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 70
nodeSelector:
karpenter.sh/nodepool: monitoring-write
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- distributor
topologyKey: "kubernetes.io/hostname"
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "300m"
memory: "512Mi"
# -------------------------------
# Index Gateway
# -------------------------------
indexGateway:
replicas: 1
maxUnavailable: 1
nodeSelector:
karpenter.sh/nodepool: monitoring-read
persistence:
enabled: true
size: 10Gi
storageClass: gp3
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- index-gateway
topologyKey: "kubernetes.io/hostname"
resources:
requests:
cpu: "50m"
memory: "256Mi"
limits:
cpu: "100m"
memory: "512Mi"
# -------------------------------
# Querier
# -------------------------------
querier:
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 20
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 70
affinity: null
nodeSelector:
karpenter.sh/nodepool: monitoring-read
resources:
requests:
cpu: "1"
memory: "1Gi"
limits:
cpu: "2"
memory: "2Gi"
# -------------------------------
# Query Frontend
# -------------------------------
queryFrontend:
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 70
nodeSelector:
karpenter.sh/nodepool: monitoring-read
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- query-frontend
topologyKey: "kubernetes.io/hostname"
resources:
requests:
cpu: "50m"
memory: "256Mi"
limits:
cpu: "100m"
memory: "512Mi"
# -------------------------------
# Query Scheduler
# -------------------------------
queryScheduler:
replicas: 1
maxUnavailable: 1
nodeSelector:
karpenter.sh/nodepool: monitoring-read
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- query-scheduler
topologyKey: "kubernetes.io/hostname"
resources:
requests:
cpu: "50m"
memory: "128Mi"
limits:
cpu: "100m"
memory: "256Mi"
# -------------------------------
# Monitoring
# -------------------------------
monitoring:
serviceMonitor:
enabled: true # Enable Prometheus monitoring
interval: 15s
scrapeTimeout: 15s # optional
labels:
release: prometheus
# -------------------------------
# Disabled Components
# -------------------------------
gateway:
enabled: false
minio:
enabled: false
chunksCache:
enabled: false
lokiCanary:
enabled: false
test:
enabled: false
ruler:
enabled: false
resultsCache:
enabled: false
backend:
replicas: 0
read:
replicas: 0
write:
replicas: 0
singleBinary:
replicas: 0
Observations:
- CPU usage spikes heavily in ingesters if query_store_only: false is set.
- Setting query_store_only: true increases query duration significantly, but avoids CPU spikes.
- We suspect slow S3 read operations or inefficient index lookups as a bottleneck.
- Loki components scale up (queriers, query-frontends) but don’t seem to improve latency.
- count_over_time() and rate() type queries especially suffer.
What We’re Looking For:
- Any tuning recommendations for faster queries from object storage (S3)
- Advice on optimizing index/compactor/query layer for long-range queries
- Suggested S3 configuration or alternatives to improve query fetch speed
- Best practices for query_range tuning in high log volume environments
- Monitoring tips to trace which component is the bottleneck (querier vs S3 vs compactor)
Any insights or similar experiences would be greatly appreciated. Please let me know if further logs or config snippets are required.
Thanks!
Dhaval
For query splitting to work in parallel you’ll need query frontend. See Query frontend example | Grafana Loki documentation
Thanks you for the prompt response! But I am already using the queryFrontend.
queryFrontend Configuration
queryFrontend:
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 70
nodeSelector:
karpenter.sh/nodepool: monitoring-read
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values:
- query-frontend
topologyKey: "kubernetes.io/hostname"
resources:
requests:
cpu: "50m"
memory: "256Mi"
limits:
cpu: "100m"
memory: "512Mi"
loki configMap.yaml
auth_enabled: false
bloom_build:
builder:
planner_address: loki-bloom-planner-headless.loki.svc.cluster.local:9095
enabled: false
bloom_gateway:
client:
addresses: dnssrvnoa+_grpc._tcp.loki-bloom-gateway-headless.loki.svc.cluster.local
enabled: false
common:
compactor_address: 'http://loki-compactor:3100'
path_prefix: /var/loki
replication_factor: 3
storage:
s3:
bucketnames: monitoring-loki
insecure: false
s3forcepathstyle: false
compactor:
apply_retention_interval: 10m
delete_request_store: s3
retention_enabled: true
frontend:
scheduler_address: loki-query-scheduler.loki.svc.cluster.local:9095
tail_proxy_url: http://loki-querier.loki.svc.cluster.local:3100
frontend_worker:
scheduler_address: loki-query-scheduler.loki.svc.cluster.local:9095
index_gateway:
mode: simple
ingester:
autoforget_unhealthy: true
chunk_encoding: snappy
chunk_idle_period: 30m
chunk_retain_period: 1h
chunk_target_size: 20971520
max_chunk_age: 1h
wal:
checkpoint_duration: 2m
dir: /var/loki/wal
enabled: true
flush_on_shutdown: true
limits_config:
allow_structured_metadata: true
increment_duplicate_timestamp: true
ingestion_burst_size_mb: 40
ingestion_rate_mb: 20
max_cache_freshness_per_query: 10m
max_chunks_per_query: 2000000
max_entries_limit_per_query: 10000
max_line_size: 4194304
max_query_length: 30d
max_query_parallelism: 32
per_stream_rate_limit: 20M
per_stream_rate_limit_burst: 40M
query_timeout: 5m
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 2160h
retention_stream:
- period: 438300h
priority: 1
selector: '{type="archive"}'
split_queries_by_interval: 1h
tsdb_max_query_parallelism: 40
volume_enabled: true
memberlist:
join_members:
- loki-memberlist
pattern_ingester:
enabled: false
querier:
max_concurrent: 4
query_store_only: false
query_range:
align_queries_with_step: true
cache_results: true
runtime_config:
file: /etc/loki/runtime-config/runtime-config.yaml
schema_config:
configs:
- from: "2025-06-01"
index:
period: 24h
prefix: index_
object_store: s3
schema: v13
store: tsdb
server:
grpc_listen_port: 9095
grpc_server_max_recv_msg_size: 16777216
grpc_server_max_send_msg_size: 16777216
http_listen_port: 3100
http_server_idle_timeout: 1200s
http_server_read_timeout: 600s
http_server_write_timeout: 600s
storage_config:
aws:
bucketnames: monitoring-loki
region: us-east-1
s3forcepathstyle: true
sse:
type: SSE-S3
bloom_shipper:
working_directory: /var/loki/data/bloomshipper
boltdb_shipper:
index_gateway_client:
server_address: dns+loki-index-gateway-headless.loki.svc.cluster.local:9095
hedging:
at: 250ms
max_per_second: 20
up_to: 3
tsdb_shipper:
active_index_directory: /var/loki/index
cache_location: /var/loki/index_cache
index_gateway_client:
server_address: dns+loki-index-gateway-headless.loki.svc.cluster.local:9095
tracing:
enabled: false
Is there anything which I miss then please let me know!
Thanks in advance!
Ah ok, didn’t quite see in your original post.
I’d say check a couple of things:
- Check query frontend logs and make sure queries are actually being split.
- Check Loki metrics and see if you see any error or delay on S3 operations.
- How big are your logs? Let’s say ball park size per day.
- How many queriers do you usually run? What’s the time frame you want to be able to query from?
Hello,
Thank you for your response!
Please find the answer as follows:
- Check query frontend logs and make sure queries are actually being split.
Ans: Confirmed from logs that queries are being split:
2025-08-06T10:39:35.429+05:30 ts=2025-08-06T05:09:35.32858555Z caller=spanlogger.go:111 middleware=QueryShard.astMapperware org_id=fake user=fake caller=log.go:168 level=warn msg="failed mapping AST" err="context canceled" query="{service_name=\"ssai\"} |= ``"
2025-08-06T10:39:36.628+05:30 level=info ts=2025-08-06T05:09:36.544193337Z caller=metrics.go:237 component=frontend org_id=fake latency=fast query="{service_name=\"ssai\"} |= ``" query_hash=2015187428 query_type=limited range_type=range length=24h0m0s start_delta=82h39m37.544164144s end_delta=58h39m37.544164292s step=2m0s duration=6.223712488s status=200 limit=10000 returned_lines=0 throughput=12MB total_bytes=74MB total_bytes_structured_metadata=8.8MB lines_per_second=47106 total_lines=293176 post_filter_lines=293176 total_entries=10000 store_chunks_download_time=5.119186253s queue_time=7.985999ms splits=1 shards=8 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=67.754584ms cache_chunk_req=800 cache_chunk_hit=0 cache_chunk_bytes_stored=598086552 cache_chunk_bytes_fetched=0 cache_chunk_download_time=368.375µs cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=1 cache_stats_results_hit=1 cache_stats_results_download_time=8.41µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=1122 index_post_bloom_filter_chunks=1122 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:39:39.531+05:30 level=info ts=2025-08-06T05:09:39.487871005Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=29m59s start_delta=59h9m39.487852995s end_delta=58h39m40.487853118s step=1h0m0s duration=10.377923499s status=200 limit=100 returned_lines=0 throughput=607MB total_bytes=6.3GB total_bytes_structured_metadata=848MB lines_per_second=2722216 total_lines=28250954 post_filter_lines=28250954 total_entries=4 store_chunks_download_time=16.478401679s queue_time=5.50771ms splits=0 shards=8 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=76.719841ms cache_chunk_req=1899 cache_chunk_hit=0 cache_chunk_bytes_stored=1683178169 cache_chunk_bytes_fetched=0 cache_chunk_download_time=1.039788ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=1 cache_stats_results_hit=1 cache_stats_results_download_time=5.104µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=1 cache_result_hit=0 cache_result_download_time=8.033µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=1899 index_post_bloom_filter_chunks=1899 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:39:40.231+05:30 level=info ts=2025-08-06T05:09:40.180580397Z caller=roundtrip.go:359 org_id=fake msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" start=2025-08-02T18:00:00Z end=2025-08-03T17:00:00Z start_delta=83h9m40.180575384s end_delta=60h9m40.180575663s length=23h0m0s step=3600000 query_hash=141503800
2025-08-06T10:40:45.431+05:30 level=info ts=2025-08-06T05:10:45.393488513Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=23h0m0s start_delta=83h10m45.393467073s end_delta=60h10m45.393467204s step=1h0m0s duration=1m5.212606515s status=200 limit=100 returned_lines=0 throughput=4.4GB total_bytes=284GB total_bytes_structured_metadata=37GB lines_per_second=19066872 total_lines=1243400452 post_filter_lines=1243400452 total_entries=4 store_chunks_download_time=12m0.288678055s queue_time=175.751401ms splits=23 shards=368 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=4.302694283s cache_chunk_req=85364 cache_chunk_hit=32 cache_chunk_bytes_stored=73082618051 cache_chunk_bytes_fetched=5983215 cache_chunk_download_time=49.557595ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=24 cache_stats_results_hit=24 cache_stats_results_download_time=129.781µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=23 cache_result_hit=0 cache_result_download_time=95.517µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=85364 index_post_bloom_filter_chunks=85364 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:40:45.831+05:30 level=info ts=2025-08-06T05:10:45.760505145Z caller=roundtrip.go:359 org_id=fake msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" start=2025-08-01T18:00:00Z end=2025-08-02T17:00:00Z start_delta=107h10m45.76050037s end_delta=84h10m45.760500837s length=23h0m0s step=3600000 query_hash=141503800
2025-08-06T10:41:46.231+05:30 level=info ts=2025-08-06T05:11:46.137238441Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=23h0m0s start_delta=107h11m46.137215557s end_delta=84h11m46.13721568s step=1h0m0s duration=1m0.3764261s status=200 limit=100 returned_lines=0 throughput=4.4GB total_bytes=263GB total_bytes_structured_metadata=35GB lines_per_second=19505412 total_lines=1177667120 post_filter_lines=1177667120 total_entries=4 store_chunks_download_time=10m58.123142771s queue_time=175.835447ms splits=23 shards=368 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=3.45767226s cache_chunk_req=87328 cache_chunk_hit=129 cache_chunk_bytes_stored=68205918315 cache_chunk_bytes_fetched=58677775 cache_chunk_download_time=58.424978ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=24 cache_stats_results_hit=24 cache_stats_results_download_time=104.218µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=23 cache_result_hit=0 cache_result_download_time=94.588µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=87328 index_post_bloom_filter_chunks=87328 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:41:46.931+05:30 level=info ts=2025-08-06T05:11:46.926211075Z caller=roundtrip.go:359 org_id=fake msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" start=2025-07-31T18:00:00Z end=2025-08-01T17:00:00Z start_delta=131h11m46.926207448s end_delta=108h11m46.926207965s length=23h0m0s step=3600000 query_hash=141503800
2025-08-06T10:42:49.831+05:30 level=info ts=2025-08-06T05:12:49.811139295Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=23h0m0s start_delta=131h12m49.811110068s end_delta=108h12m49.811110191s step=1h0m0s duration=1m2.884631985s status=200 limit=100 returned_lines=0 throughput=4.4GB total_bytes=278GB total_bytes_structured_metadata=38GB lines_per_second=20133710 total_lines=1266100972 post_filter_lines=1266100972 total_entries=4 store_chunks_download_time=11m26.897681013s queue_time=503.662505ms splits=23 shards=368 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=5.605731355s cache_chunk_req=96229 cache_chunk_hit=1 cache_chunk_bytes_stored=72613362090 cache_chunk_bytes_fetched=608191 cache_chunk_download_time=57.692767ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=24 cache_stats_results_hit=24 cache_stats_results_download_time=134.333µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=23 cache_result_hit=0 cache_result_download_time=98.597µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=96229 index_post_bloom_filter_chunks=96229 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T11:19:49.531+05:30 level=info ts=2025-08-06T05:49:49.482089139Z caller=roundtrip.go:414 org_id=fake msg="executing query" type=labels label= length=6h0m0s query=
2025-08-06T11:19:50.231+05:30 level=info ts=2025-08-06T05:49:50.146402913Z caller=metrics.go:292 component=frontend org_id=fake latency=fast query_type=labels splits=2 start=2025-08-05T23:49:41.519Z end=2025-08-06T05:49:41.519Z start_delta=6h0m8.627397884s end_delta=8.6273984s length=6h0m0s duration=664.111405ms status=200 label= query= query_hash=2166136261 total_entries=7 cache_label_results_req=0 cache_label_results_hit=0 cache_label_results_stored=0 cache_label_results_download_time=0s cache_label_results_query_length_served=0s
- Check Loki metrics and see if you see any error or delay on S3 operations.
Ans: From querier logs, some queries marked with latency=slow query
level=info ts=2025-08-07T03:50:20.016511512Z caller=metrics.go:237 component=querier org_id=fake latency=slow query="sum by (level,detected_level)(count_over_time({service_name=\"app\"} | drop __error__[1m]))" query_hash=3585720562 query_type=metric range_type=range length=59m0s start_delta=4h50m20.01646929s end_delta=3h51m20.01646954s step=1m0s duration=12.41129697s status=200 limit=100 returned_lines=0 throughput=54MB total_bytes=677MB total_bytes_structured_metadata=77MB lines_per_second=207124 total_lines=2570680 post_filter_lines=2570680 total_entries=191 store_chunks_download_time=4.354657942s queue_time=91.774µs splits=0 shards=0 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=5.570947ms cache_chunk_req=203 cache_chunk_hit=0 cache_chunk_bytes_stored=193827985 cache_chunk_bytes_fetched=0 cache_chunk_download_time=118.976µs cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=203 index_post_bloom_filter_chunks=203 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
-
How big are your logs? Let’s say ball park size per day.
Ans:
- Approximate daily logs:
- Data Volume: 40–45 GiB/day
- Line Count: ~466.7 million lines/day
-
How many queriers do you usually run? What’s the time frame you want to be able to query from?
Ans:
- Querier Instances: ~100 queriers during performance testing
- Query Windows Tested :
- 1 day: ~1m 10s
- 3, 5, 7 days: progressively slower — testing to understand impact
Additionally, we increased the chunk_target_size
from 1.5 MB to 20 MB to reduce the number of PUT requests to the S3 bucket and minimize the number of chunk objects fetched during queries, as we suspect S3 read latency is contributing to performance issues. However, after applying this change, the querier pods became unresponsive and started crashing (BackOffRestarting), making it impossible to retrieve logs from the containers.
We have already referred to the official Loki query performance blog (link), but haven’t seen significant improvements.
Thanks!
Hello,
Thank you for your response!
Please find the answer as follows:
- Check query frontend logs and make sure queries are actually being split.
Ans: Confirmed from logs that queries are being split:
2025-08-06T10:39:35.429+05:30 ts=2025-08-06T05:09:35.32858555Z caller=spanlogger.go:111 middleware=QueryShard.astMapperware org_id=fake user=fake caller=log.go:168 level=warn msg="failed mapping AST" err="context canceled" query="{service_name=\"ssai\"} |= ``"
2025-08-06T10:39:36.628+05:30 level=info ts=2025-08-06T05:09:36.544193337Z caller=metrics.go:237 component=frontend org_id=fake latency=fast query="{service_name=\"ssai\"} |= ``" query_hash=2015187428 query_type=limited range_type=range length=24h0m0s start_delta=82h39m37.544164144s end_delta=58h39m37.544164292s step=2m0s duration=6.223712488s status=200 limit=10000 returned_lines=0 throughput=12MB total_bytes=74MB total_bytes_structured_metadata=8.8MB lines_per_second=47106 total_lines=293176 post_filter_lines=293176 total_entries=10000 store_chunks_download_time=5.119186253s queue_time=7.985999ms splits=1 shards=8 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=67.754584ms cache_chunk_req=800 cache_chunk_hit=0 cache_chunk_bytes_stored=598086552 cache_chunk_bytes_fetched=0 cache_chunk_download_time=368.375µs cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=1 cache_stats_results_hit=1 cache_stats_results_download_time=8.41µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=1122 index_post_bloom_filter_chunks=1122 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:39:39.531+05:30 level=info ts=2025-08-06T05:09:39.487871005Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=29m59s start_delta=59h9m39.487852995s end_delta=58h39m40.487853118s step=1h0m0s duration=10.377923499s status=200 limit=100 returned_lines=0 throughput=607MB total_bytes=6.3GB total_bytes_structured_metadata=848MB lines_per_second=2722216 total_lines=28250954 post_filter_lines=28250954 total_entries=4 store_chunks_download_time=16.478401679s queue_time=5.50771ms splits=0 shards=8 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=76.719841ms cache_chunk_req=1899 cache_chunk_hit=0 cache_chunk_bytes_stored=1683178169 cache_chunk_bytes_fetched=0 cache_chunk_download_time=1.039788ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=1 cache_stats_results_hit=1 cache_stats_results_download_time=5.104µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=1 cache_result_hit=0 cache_result_download_time=8.033µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=1899 index_post_bloom_filter_chunks=1899 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:39:40.231+05:30 level=info ts=2025-08-06T05:09:40.180580397Z caller=roundtrip.go:359 org_id=fake msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" start=2025-08-02T18:00:00Z end=2025-08-03T17:00:00Z start_delta=83h9m40.180575384s end_delta=60h9m40.180575663s length=23h0m0s step=3600000 query_hash=141503800
2025-08-06T10:40:45.431+05:30 level=info ts=2025-08-06T05:10:45.393488513Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=23h0m0s start_delta=83h10m45.393467073s end_delta=60h10m45.393467204s step=1h0m0s duration=1m5.212606515s status=200 limit=100 returned_lines=0 throughput=4.4GB total_bytes=284GB total_bytes_structured_metadata=37GB lines_per_second=19066872 total_lines=1243400452 post_filter_lines=1243400452 total_entries=4 store_chunks_download_time=12m0.288678055s queue_time=175.751401ms splits=23 shards=368 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=4.302694283s cache_chunk_req=85364 cache_chunk_hit=32 cache_chunk_bytes_stored=73082618051 cache_chunk_bytes_fetched=5983215 cache_chunk_download_time=49.557595ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=24 cache_stats_results_hit=24 cache_stats_results_download_time=129.781µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=23 cache_result_hit=0 cache_result_download_time=95.517µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=85364 index_post_bloom_filter_chunks=85364 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:40:45.831+05:30 level=info ts=2025-08-06T05:10:45.760505145Z caller=roundtrip.go:359 org_id=fake msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" start=2025-08-01T18:00:00Z end=2025-08-02T17:00:00Z start_delta=107h10m45.76050037s end_delta=84h10m45.760500837s length=23h0m0s step=3600000 query_hash=141503800
2025-08-06T10:41:46.231+05:30 level=info ts=2025-08-06T05:11:46.137238441Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=23h0m0s start_delta=107h11m46.137215557s end_delta=84h11m46.13721568s step=1h0m0s duration=1m0.3764261s status=200 limit=100 returned_lines=0 throughput=4.4GB total_bytes=263GB total_bytes_structured_metadata=35GB lines_per_second=19505412 total_lines=1177667120 post_filter_lines=1177667120 total_entries=4 store_chunks_download_time=10m58.123142771s queue_time=175.835447ms splits=23 shards=368 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=3.45767226s cache_chunk_req=87328 cache_chunk_hit=129 cache_chunk_bytes_stored=68205918315 cache_chunk_bytes_fetched=58677775 cache_chunk_download_time=58.424978ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=24 cache_stats_results_hit=24 cache_stats_results_download_time=104.218µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=23 cache_result_hit=0 cache_result_download_time=94.588µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=87328 index_post_bloom_filter_chunks=87328 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:41:46.931+05:30 level=info ts=2025-08-06T05:11:46.926211075Z caller=roundtrip.go:359 org_id=fake msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" start=2025-07-31T18:00:00Z end=2025-08-01T17:00:00Z start_delta=131h11m46.926207448s end_delta=108h11m46.926207965s length=23h0m0s step=3600000 query_hash=141503800
2025-08-06T10:42:49.831+05:30 level=info ts=2025-08-06T05:12:49.811139295Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=23h0m0s start_delta=131h12m49.811110068s end_delta=108h12m49.811110191s step=1h0m0s duration=1m2.884631985s status=200 limit=100 returned_lines=0 throughput=4.4GB total_bytes=278GB total_bytes_structured_metadata=38GB lines_per_second=20133710 total_lines=1266100972 post_filter_lines=1266100972 total_entries=4 store_chunks_download_time=11m26.897681013s queue_time=503.662505ms splits=23 shards=368 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=5.605731355s cache_chunk_req=96229 cache_chunk_hit=1 cache_chunk_bytes_stored=72613362090 cache_chunk_bytes_fetched=608191 cache_chunk_download_time=57.692767ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=24 cache_stats_results_hit=24 cache_stats_results_download_time=134.333µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=23 cache_result_hit=0 cache_result_download_time=98.597µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=96229 index_post_bloom_filter_chunks=96229 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T11:19:49.531+05:30 level=info ts=2025-08-06T05:49:49.482089139Z caller=roundtrip.go:414 org_id=fake msg="executing query" type=labels label= length=6h0m0s query=
2025-08-06T11:19:50.231+05:30 level=info ts=2025-08-06T05:49:50.146402913Z caller=metrics.go:292 component=frontend org_id=fake latency=fast query_type=labels splits=2 start=2025-08-05T23:49:41.519Z end=2025-08-06T05:49:41.519Z start_delta=6h0m8.627397884s end_delta=8.6273984s length=6h0m0s duration=664.111405ms status=200 label= query= query_hash=2166136261 total_entries=7 cache_label_results_req=0 cache_label_results_hit=0 cache_label_results_stored=0 cache_label_results_download_time=0s cache_label_results_query_length_served=0s
- Check Loki metrics and see if you see any error or delay on S3 operations.
Ans: From querier logs, some queries marked with latency=slow query
level=info ts=2025-08-07T03:50:20.016511512Z caller=metrics.go:237 component=querier org_id=fake latency=slow query="sum by (level,detected_level)(count_over_time({service_name=\"app\"} | drop __error__[1m]))" query_hash=3585720562 query_type=metric range_type=range length=59m0s start_delta=4h50m20.01646929s end_delta=3h51m20.01646954s step=1m0s duration=12.41129697s status=200 limit=100 returned_lines=0 throughput=54MB total_bytes=677MB total_bytes_structured_metadata=77MB lines_per_second=207124 total_lines=2570680 post_filter_lines=2570680 total_entries=191 store_chunks_download_time=4.354657942s queue_time=91.774µs splits=0 shards=0 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=5.570947ms cache_chunk_req=203 cache_chunk_hit=0 cache_chunk_bytes_stored=193827985 cache_chunk_bytes_fetched=0 cache_chunk_download_time=118.976µs cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=203 index_post_bloom_filter_chunks=203 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
-
How big are your logs? Let’s say ball park size per day.
Ans:
- Approximate daily logs:
- Data Volume: 40–45 GiB/day
- Line Count: ~466.7 million lines/day
-
How many queriers do you usually run? What’s the time frame you want to be able to query from?
Ans:
- Querier Instances: ~100 queriers during performance testing
- Query Windows Tested :
- 1 day: ~1m 10s
- 3, 5, 7 days: progressively slower — testing to understand impact
Additionally, we increased the chunk_target_size
from 1.5 MB to 20 MB to reduce the number of PUT requests to the S3 bucket and minimize the number of chunk objects fetched during queries, as we suspect S3 read latency is contributing to performance issues. However, after applying this change, the querier pods became unresponsive and started crashing (BackOffRestarting), making it impossible to retrieve logs from the containers.
We have already referred to the official Loki query performance blog (link), but haven’t seen significant improvements.
Thanks!