Loki Queries Taking Too Long When Using S3 Storage Backend (Loki v3.x)

Here’s a clean, detailed draft you can use to create a post on the Grafana Community Forum. It explains the performance issue with enough technical depth to invite helpful responses:


Title:

Loki Queries Taking Too Long When Using S3 Storage Backend (Loki v3.x)


Body:

Hi all,

We’re currently facing performance issues with Grafana Loki when running log queries through Grafana. Below are the details of our setup and the problem we are encountering:


Environment Details:

  • Grafana Loki Version: v3.x (recent)
  • Deployment Mode: Distributed
  • Storage Backend: Amazon S3 (TSDB schema v13)
  • Log Volume: Moderate to high (structured logs from EKS workloads)
  • Querier Setup: Querier + Query-Frontend + Distributor setup with autoscaling
  • S3 Configuration: Using IRSA for S3 access
  • Index Gateway: Enabled
  • Auth Mode: Single-tenant, auth disabled
  • Compactor: Enabled and running periodically
  • Chunk Compression: Snappy

Issue:

When running log queries in Grafana (even with short durations like 5 days), the response time is very slow, often taking 3–5 minutes or more. For longer durations (10+ days), we sometimes hit 504 Gateway Timeout errors.

Even with tsdb_max_query_parallelism and querier.max_concurrent set to fairly high values, performance does not improve significantly.


Tuning Parameters Already Applied:

limits_config:
  tsdb_max_query_parallelism: 16
  max_query_lookback: 30d

querier:
  max_concurrent: 4

query_range:
  split_queries_by_interval: 1h
  parallelise_shardable_queries: true

Loki Configuration:

loki:
  auth_enabled: false  # Disable basic auth; assumes secured access via other means

  schemaConfig:
    configs:
      - from: "2025-06-01"
        store: tsdb  # Use TSDB for storage format
        object_store: s3  # Store logs in AWS S3
        schema: v13  # Loki schema version
        index:
          prefix: index_  # Prefix for index files in object store
          period: 24h  # Index period (rollover every 24h)

  storage:
    bucketNames:
      chunks: monitoring-loki # S3 bucket for storing chunks

  storage_config:
    aws:
      bucketnames: monitoring-loki  # S3 bucket name
      region: us-east-1
      s3forcepathstyle: true  # Required for some S3-compatible stores
      sse:
        type: SSE-S3  # Enable server-side encryption
    tsdb_shipper:
      active_index_directory: /var/loki/index
      cache_location: /var/loki/index_cache  # Local index cache

  server:
    http_server_read_timeout: 600s
    http_server_write_timeout: 600s
    http_server_idle_timeout: 1200s
    grpc_server_max_recv_msg_size: 16777216  # Max gRPC receive size (16MB)
    grpc_server_max_send_msg_size: 16777216  # Max gRPC send size (16MB)

  ingester:
    autoforget_unhealthy: true
    chunk_encoding: snappy  # Compress chunks with snappy
    # Controls how long to keep a chunk open if no new logs arrive
    chunk_idle_period: 2m  # faster flushing helps during pod shutdowns
    # Maximum age of a chunk before it's flushed
    max_chunk_age: 1h  # balance between memory usage and flush frequency
    # How long to retain flushed chunks in memory
    chunk_retain_period: 1h  # Retain flushed chunks in memory
    wal:
      enabled: true
      dir: /var/loki/wal
      flush_on_shutdown: true
      checkpoint_duration: 2m  # More frequent checkpoints = faster recovery after crash

  pattern_ingester:
    enabled: false  # Disabled by default

  limits_config:
    max_query_length: 30d
    max_query_parallelism: 16
    max_chunks_per_query: 2000000
    split_queries_by_interval: 1h  # Split queries into 1-hour intervals
    query_timeout: 5m
    max_entries_limit_per_query: 10000
    allow_structured_metadata: true
    volume_enabled: true
    retention_period: 2160h  # Global retention (90 days)
    retention_stream:
       - selector: '{type="archive"}'
         priority: 1
         period: 438300h  # Retention for archive logs (50 years)
    ingestion_rate_mb: 20
    ingestion_burst_size_mb: 40
    per_stream_rate_limit: 20M
    per_stream_rate_limit_burst: 40M
    max_line_size: 4194304 # Max line size for logs (4MB)
    increment_duplicate_timestamp: true  # Increment duplicate timestamps to avoid issues with log ingestion

  compactor:
    retention_enabled: true
    apply_retention_interval: 10m
    delete_request_store: s3  # Delete data directly from S3

  querier:
    max_concurrent: 4  # Max concurrent queries
    query_store_only: true # Only query the store, not the ingesters

  query_range:
    align_queries_with_step: true
    cache_results: true  # Enable query result caching

  cache_config:
    default_validity: 2h
    embedded_cache:
      enabled: false  # Disable embedded cache
      max_size_mb: 512  # Max size for embedded cache
      ttl: 2h

serviceAccount:
  create: true
  name: loki
  annotations:
    "eks.amazonaws.com/role-arn": "arn:aws:iam::xxxxxxxxx:role/monitoring-loki"

deploymentMode: Distributed  # Enable distributed mode

# -------------------------------
# Ingesters
# -------------------------------
ingester:
  lifecycle:
    preStop:
      exec:
        command: ["/bin/sh", "-c", "wget -qO- localhost:3100/ingester/shutdown && sleep 300"]
  zoneAwareReplication:
    # Assign nodepool per zone
    zoneA:
      nodeSelector:
        karpenter.sh/nodepool: monitoring-write
    zoneB:
      nodeSelector:
        karpenter.sh/nodepool: monitoring-write
    zoneC:
      nodeSelector:
        karpenter.sh/nodepool: monitoring-write
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 21
    targetCPUUtilizationPercentage: 80
    targetMemoryUtilizationPercentage: 80
  persistence:
    enabled: true
    size: 10Gi
    storageClass: gp3
    accessModes:
      - ReadWriteOnce
  resources:
    requests:
      cpu: "200m"
      memory: "2Gi"
    limits:
      cpu: "300m"
      memory: "3Gi"

# -------------------------------
# Compactor
# -------------------------------
compactor:
  replicas: 1
  nodeSelector:
    karpenter.sh/nodepool: monitoring-write
  persistence:
    enabled: true
    size: 10Gi
    storageClass: gp3
  resources:
    requests:
      cpu: "100m"
      memory: "256Mi"
    limits:
      cpu: "250m"
      memory: "512Mi"

# -------------------------------
# Distributor
# -------------------------------
distributor:
  lifecycle:
    preStop:
      exec:
        command: ["/bin/sh", "-c", "sleep 20"]
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 20
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 70
  nodeSelector:
    karpenter.sh/nodepool: monitoring-write
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                  - distributor
          topologyKey: "kubernetes.io/hostname"
  resources:
    requests:
      cpu: "200m"
      memory: "256Mi"
    limits:
      cpu: "300m"
      memory: "512Mi"

# -------------------------------
# Index Gateway
# -------------------------------
indexGateway:
  replicas: 1
  maxUnavailable: 1
  nodeSelector:
    karpenter.sh/nodepool: monitoring-read
  persistence:
    enabled: true
    size: 10Gi
    storageClass: gp3
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                  - index-gateway
          topologyKey: "kubernetes.io/hostname"
  resources:
    requests:
      cpu: "50m"
      memory: "256Mi"
    limits:
      cpu: "100m"
      memory: "512Mi"

# -------------------------------
# Querier
# -------------------------------
querier:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 20
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 70
  affinity: null  
  nodeSelector:
    karpenter.sh/nodepool: monitoring-read
  resources:
    requests:
      cpu: "1"
      memory: "1Gi"
    limits:
      cpu: "2"
      memory: "2Gi"

# -------------------------------
# Query Frontend
# -------------------------------
queryFrontend:
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 70
  nodeSelector:
    karpenter.sh/nodepool: monitoring-read
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                  - query-frontend
          topologyKey: "kubernetes.io/hostname"
  resources:
    requests:
      cpu: "50m"
      memory: "256Mi"
    limits:
      cpu: "100m"
      memory: "512Mi"

# -------------------------------
# Query Scheduler
# -------------------------------
queryScheduler:
  replicas: 1
  maxUnavailable: 1
  nodeSelector:
    karpenter.sh/nodepool: monitoring-read
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                  - query-scheduler
          topologyKey: "kubernetes.io/hostname"
  resources:
    requests:
      cpu: "50m"
      memory: "128Mi"
    limits:
      cpu: "100m"
      memory: "256Mi"

# -------------------------------
# Monitoring
# -------------------------------
monitoring:
  serviceMonitor:
    enabled: true  # Enable Prometheus monitoring
    interval: 15s
    scrapeTimeout: 15s  # optional
    labels:
      release: prometheus

# -------------------------------
# Disabled Components
# -------------------------------
gateway:
  enabled: false
minio:
  enabled: false
chunksCache:
  enabled: false
lokiCanary:
  enabled: false
test:
  enabled: false
ruler:
  enabled: false
resultsCache:
  enabled: false
backend:
  replicas: 0
read:
  replicas: 0
write:
  replicas: 0
singleBinary:
  replicas: 0

Observations:

  • CPU usage spikes heavily in ingesters if query_store_only: false is set.
  • Setting query_store_only: true increases query duration significantly, but avoids CPU spikes.
  • We suspect slow S3 read operations or inefficient index lookups as a bottleneck.
  • Loki components scale up (queriers, query-frontends) but don’t seem to improve latency.
  • count_over_time() and rate() type queries especially suffer.

What We’re Looking For:

  • Any tuning recommendations for faster queries from object storage (S3)
  • Advice on optimizing index/compactor/query layer for long-range queries
  • Suggested S3 configuration or alternatives to improve query fetch speed
  • Best practices for query_range tuning in high log volume environments
  • Monitoring tips to trace which component is the bottleneck (querier vs S3 vs compactor)

Any insights or similar experiences would be greatly appreciated. Please let me know if further logs or config snippets are required.

Thanks!
Dhaval

For query splitting to work in parallel you’ll need query frontend. See Query frontend example | Grafana Loki documentation

Thanks you for the prompt response! But I am already using the queryFrontend.

queryFrontend Configuration

queryFrontend:
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 70
  nodeSelector:
    karpenter.sh/nodepool: monitoring-read
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                  - query-frontend
          topologyKey: "kubernetes.io/hostname"
  resources:
    requests:
      cpu: "50m"
      memory: "256Mi"
    limits:
      cpu: "100m"
      memory: "512Mi"

loki configMap.yaml


auth_enabled: false
bloom_build:
  builder:
    planner_address: loki-bloom-planner-headless.loki.svc.cluster.local:9095
  enabled: false
bloom_gateway:
  client:
    addresses: dnssrvnoa+_grpc._tcp.loki-bloom-gateway-headless.loki.svc.cluster.local
  enabled: false
common:
  compactor_address: 'http://loki-compactor:3100'
  path_prefix: /var/loki
  replication_factor: 3
  storage:
    s3:
      bucketnames: monitoring-loki
      insecure: false
      s3forcepathstyle: false
compactor:
  apply_retention_interval: 10m
  delete_request_store: s3
  retention_enabled: true
frontend:
  scheduler_address: loki-query-scheduler.loki.svc.cluster.local:9095
  tail_proxy_url: http://loki-querier.loki.svc.cluster.local:3100
frontend_worker:
  scheduler_address: loki-query-scheduler.loki.svc.cluster.local:9095
index_gateway:
  mode: simple
ingester:
  autoforget_unhealthy: true
  chunk_encoding: snappy
  chunk_idle_period: 30m
  chunk_retain_period: 1h
  chunk_target_size: 20971520
  max_chunk_age: 1h
  wal:
    checkpoint_duration: 2m
    dir: /var/loki/wal
    enabled: true
    flush_on_shutdown: true
limits_config:
  allow_structured_metadata: true
  increment_duplicate_timestamp: true
  ingestion_burst_size_mb: 40
  ingestion_rate_mb: 20
  max_cache_freshness_per_query: 10m
  max_chunks_per_query: 2000000
  max_entries_limit_per_query: 10000
  max_line_size: 4194304
  max_query_length: 30d
  max_query_parallelism: 32
  per_stream_rate_limit: 20M
  per_stream_rate_limit_burst: 40M
  query_timeout: 5m
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 2160h
  retention_stream:
  - period: 438300h
    priority: 1
    selector: '{type="archive"}'
  split_queries_by_interval: 1h
  tsdb_max_query_parallelism: 40
  volume_enabled: true
memberlist:
  join_members:
  - loki-memberlist
pattern_ingester:
  enabled: false
querier:
  max_concurrent: 4
  query_store_only: false
query_range:
  align_queries_with_step: true
  cache_results: true
runtime_config:
  file: /etc/loki/runtime-config/runtime-config.yaml
schema_config:
  configs:
  - from: "2025-06-01"
    index:
      period: 24h
      prefix: index_
    object_store: s3
    schema: v13
    store: tsdb
server:
  grpc_listen_port: 9095
  grpc_server_max_recv_msg_size: 16777216
  grpc_server_max_send_msg_size: 16777216
  http_listen_port: 3100
  http_server_idle_timeout: 1200s
  http_server_read_timeout: 600s
  http_server_write_timeout: 600s
storage_config:
  aws:
    bucketnames: monitoring-loki
    region: us-east-1
    s3forcepathstyle: true
    sse:
      type: SSE-S3
  bloom_shipper:
    working_directory: /var/loki/data/bloomshipper
  boltdb_shipper:
    index_gateway_client:
      server_address: dns+loki-index-gateway-headless.loki.svc.cluster.local:9095
  hedging:
    at: 250ms
    max_per_second: 20
    up_to: 3
  tsdb_shipper:
    active_index_directory: /var/loki/index
    cache_location: /var/loki/index_cache
    index_gateway_client:
      server_address: dns+loki-index-gateway-headless.loki.svc.cluster.local:9095
tracing:
  enabled: false

Is there anything which I miss then please let me know!

Thanks in advance!

Ah ok, didn’t quite see in your original post.

I’d say check a couple of things:

  1. Check query frontend logs and make sure queries are actually being split.
  2. Check Loki metrics and see if you see any error or delay on S3 operations.
  3. How big are your logs? Let’s say ball park size per day.
  4. How many queriers do you usually run? What’s the time frame you want to be able to query from?

Hello,

Thank you for your response!

Please find the answer as follows:

  1. Check query frontend logs and make sure queries are actually being split.
    Ans: Confirmed from logs that queries are being split:
2025-08-06T10:39:35.429+05:30 ts=2025-08-06T05:09:35.32858555Z caller=spanlogger.go:111 middleware=QueryShard.astMapperware org_id=fake user=fake caller=log.go:168 level=warn msg="failed mapping AST" err="context canceled" query="{service_name=\"ssai\"} |= ``"
2025-08-06T10:39:36.628+05:30 level=info ts=2025-08-06T05:09:36.544193337Z caller=metrics.go:237 component=frontend org_id=fake latency=fast query="{service_name=\"ssai\"} |= ``" query_hash=2015187428 query_type=limited range_type=range length=24h0m0s start_delta=82h39m37.544164144s end_delta=58h39m37.544164292s step=2m0s duration=6.223712488s status=200 limit=10000 returned_lines=0 throughput=12MB total_bytes=74MB total_bytes_structured_metadata=8.8MB lines_per_second=47106 total_lines=293176 post_filter_lines=293176 total_entries=10000 store_chunks_download_time=5.119186253s queue_time=7.985999ms splits=1 shards=8 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=67.754584ms cache_chunk_req=800 cache_chunk_hit=0 cache_chunk_bytes_stored=598086552 cache_chunk_bytes_fetched=0 cache_chunk_download_time=368.375µs cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=1 cache_stats_results_hit=1 cache_stats_results_download_time=8.41µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=1122 index_post_bloom_filter_chunks=1122 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:39:39.531+05:30 level=info ts=2025-08-06T05:09:39.487871005Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=29m59s start_delta=59h9m39.487852995s end_delta=58h39m40.487853118s step=1h0m0s duration=10.377923499s status=200 limit=100 returned_lines=0 throughput=607MB total_bytes=6.3GB total_bytes_structured_metadata=848MB lines_per_second=2722216 total_lines=28250954 post_filter_lines=28250954 total_entries=4 store_chunks_download_time=16.478401679s queue_time=5.50771ms splits=0 shards=8 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=76.719841ms cache_chunk_req=1899 cache_chunk_hit=0 cache_chunk_bytes_stored=1683178169 cache_chunk_bytes_fetched=0 cache_chunk_download_time=1.039788ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=1 cache_stats_results_hit=1 cache_stats_results_download_time=5.104µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=1 cache_result_hit=0 cache_result_download_time=8.033µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=1899 index_post_bloom_filter_chunks=1899 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:39:40.231+05:30 level=info ts=2025-08-06T05:09:40.180580397Z caller=roundtrip.go:359 org_id=fake msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" start=2025-08-02T18:00:00Z end=2025-08-03T17:00:00Z start_delta=83h9m40.180575384s end_delta=60h9m40.180575663s length=23h0m0s step=3600000 query_hash=141503800
2025-08-06T10:40:45.431+05:30 level=info ts=2025-08-06T05:10:45.393488513Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=23h0m0s start_delta=83h10m45.393467073s end_delta=60h10m45.393467204s step=1h0m0s duration=1m5.212606515s status=200 limit=100 returned_lines=0 throughput=4.4GB total_bytes=284GB total_bytes_structured_metadata=37GB lines_per_second=19066872 total_lines=1243400452 post_filter_lines=1243400452 total_entries=4 store_chunks_download_time=12m0.288678055s queue_time=175.751401ms splits=23 shards=368 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=4.302694283s cache_chunk_req=85364 cache_chunk_hit=32 cache_chunk_bytes_stored=73082618051 cache_chunk_bytes_fetched=5983215 cache_chunk_download_time=49.557595ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=24 cache_stats_results_hit=24 cache_stats_results_download_time=129.781µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=23 cache_result_hit=0 cache_result_download_time=95.517µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=85364 index_post_bloom_filter_chunks=85364 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:40:45.831+05:30 level=info ts=2025-08-06T05:10:45.760505145Z caller=roundtrip.go:359 org_id=fake msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" start=2025-08-01T18:00:00Z end=2025-08-02T17:00:00Z start_delta=107h10m45.76050037s end_delta=84h10m45.760500837s length=23h0m0s step=3600000 query_hash=141503800
2025-08-06T10:41:46.231+05:30 level=info ts=2025-08-06T05:11:46.137238441Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=23h0m0s start_delta=107h11m46.137215557s end_delta=84h11m46.13721568s step=1h0m0s duration=1m0.3764261s status=200 limit=100 returned_lines=0 throughput=4.4GB total_bytes=263GB total_bytes_structured_metadata=35GB lines_per_second=19505412 total_lines=1177667120 post_filter_lines=1177667120 total_entries=4 store_chunks_download_time=10m58.123142771s queue_time=175.835447ms splits=23 shards=368 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=3.45767226s cache_chunk_req=87328 cache_chunk_hit=129 cache_chunk_bytes_stored=68205918315 cache_chunk_bytes_fetched=58677775 cache_chunk_download_time=58.424978ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=24 cache_stats_results_hit=24 cache_stats_results_download_time=104.218µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=23 cache_result_hit=0 cache_result_download_time=94.588µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=87328 index_post_bloom_filter_chunks=87328 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:41:46.931+05:30 level=info ts=2025-08-06T05:11:46.926211075Z caller=roundtrip.go:359 org_id=fake msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" start=2025-07-31T18:00:00Z end=2025-08-01T17:00:00Z start_delta=131h11m46.926207448s end_delta=108h11m46.926207965s length=23h0m0s step=3600000 query_hash=141503800
2025-08-06T10:42:49.831+05:30 level=info ts=2025-08-06T05:12:49.811139295Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=23h0m0s start_delta=131h12m49.811110068s end_delta=108h12m49.811110191s step=1h0m0s duration=1m2.884631985s status=200 limit=100 returned_lines=0 throughput=4.4GB total_bytes=278GB total_bytes_structured_metadata=38GB lines_per_second=20133710 total_lines=1266100972 post_filter_lines=1266100972 total_entries=4 store_chunks_download_time=11m26.897681013s queue_time=503.662505ms splits=23 shards=368 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=5.605731355s cache_chunk_req=96229 cache_chunk_hit=1 cache_chunk_bytes_stored=72613362090 cache_chunk_bytes_fetched=608191 cache_chunk_download_time=57.692767ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=24 cache_stats_results_hit=24 cache_stats_results_download_time=134.333µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=23 cache_result_hit=0 cache_result_download_time=98.597µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=96229 index_post_bloom_filter_chunks=96229 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T11:19:49.531+05:30 level=info ts=2025-08-06T05:49:49.482089139Z caller=roundtrip.go:414 org_id=fake msg="executing query" type=labels label= length=6h0m0s query=
2025-08-06T11:19:50.231+05:30 level=info ts=2025-08-06T05:49:50.146402913Z caller=metrics.go:292 component=frontend org_id=fake latency=fast query_type=labels splits=2 start=2025-08-05T23:49:41.519Z end=2025-08-06T05:49:41.519Z start_delta=6h0m8.627397884s end_delta=8.6273984s length=6h0m0s duration=664.111405ms status=200 label= query= query_hash=2166136261 total_entries=7 cache_label_results_req=0 cache_label_results_hit=0 cache_label_results_stored=0 cache_label_results_download_time=0s cache_label_results_query_length_served=0s
  1. Check Loki metrics and see if you see any error or delay on S3 operations.
    Ans: From querier logs, some queries marked with latency=slow query
level=info ts=2025-08-07T03:50:20.016511512Z caller=metrics.go:237 component=querier org_id=fake latency=slow query="sum by (level,detected_level)(count_over_time({service_name=\"app\"} | drop __error__[1m]))" query_hash=3585720562 query_type=metric range_type=range length=59m0s start_delta=4h50m20.01646929s end_delta=3h51m20.01646954s step=1m0s duration=12.41129697s status=200 limit=100 returned_lines=0 throughput=54MB total_bytes=677MB total_bytes_structured_metadata=77MB lines_per_second=207124 total_lines=2570680 post_filter_lines=2570680 total_entries=191 store_chunks_download_time=4.354657942s queue_time=91.774µs splits=0 shards=0 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=5.570947ms cache_chunk_req=203 cache_chunk_hit=0 cache_chunk_bytes_stored=193827985 cache_chunk_bytes_fetched=0 cache_chunk_download_time=118.976µs cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=203 index_post_bloom_filter_chunks=203 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
  1. How big are your logs? Let’s say ball park size per day.
    Ans:

    • Approximate daily logs:
      • Data Volume: 40–45 GiB/day
      • Line Count: ~466.7 million lines/day
  2. How many queriers do you usually run? What’s the time frame you want to be able to query from?
    Ans:

    • Querier Instances: ~100 queriers during performance testing
    • Query Windows Tested :
      • 1 day: ~1m 10s
      • 3, 5, 7 days: progressively slower — testing to understand impact

Additionally, we increased the chunk_target_size from 1.5 MB to 20 MB to reduce the number of PUT requests to the S3 bucket and minimize the number of chunk objects fetched during queries, as we suspect S3 read latency is contributing to performance issues. However, after applying this change, the querier pods became unresponsive and started crashing (BackOffRestarting), making it impossible to retrieve logs from the containers.

We have already referred to the official Loki query performance blog (link), but haven’t seen significant improvements.

Thanks!

Hello,

Thank you for your response!

Please find the answer as follows:

  1. Check query frontend logs and make sure queries are actually being split.
    Ans: Confirmed from logs that queries are being split:
2025-08-06T10:39:35.429+05:30 ts=2025-08-06T05:09:35.32858555Z caller=spanlogger.go:111 middleware=QueryShard.astMapperware org_id=fake user=fake caller=log.go:168 level=warn msg="failed mapping AST" err="context canceled" query="{service_name=\"ssai\"} |= ``"
2025-08-06T10:39:36.628+05:30 level=info ts=2025-08-06T05:09:36.544193337Z caller=metrics.go:237 component=frontend org_id=fake latency=fast query="{service_name=\"ssai\"} |= ``" query_hash=2015187428 query_type=limited range_type=range length=24h0m0s start_delta=82h39m37.544164144s end_delta=58h39m37.544164292s step=2m0s duration=6.223712488s status=200 limit=10000 returned_lines=0 throughput=12MB total_bytes=74MB total_bytes_structured_metadata=8.8MB lines_per_second=47106 total_lines=293176 post_filter_lines=293176 total_entries=10000 store_chunks_download_time=5.119186253s queue_time=7.985999ms splits=1 shards=8 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=67.754584ms cache_chunk_req=800 cache_chunk_hit=0 cache_chunk_bytes_stored=598086552 cache_chunk_bytes_fetched=0 cache_chunk_download_time=368.375µs cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=1 cache_stats_results_hit=1 cache_stats_results_download_time=8.41µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=1122 index_post_bloom_filter_chunks=1122 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:39:39.531+05:30 level=info ts=2025-08-06T05:09:39.487871005Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=29m59s start_delta=59h9m39.487852995s end_delta=58h39m40.487853118s step=1h0m0s duration=10.377923499s status=200 limit=100 returned_lines=0 throughput=607MB total_bytes=6.3GB total_bytes_structured_metadata=848MB lines_per_second=2722216 total_lines=28250954 post_filter_lines=28250954 total_entries=4 store_chunks_download_time=16.478401679s queue_time=5.50771ms splits=0 shards=8 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=76.719841ms cache_chunk_req=1899 cache_chunk_hit=0 cache_chunk_bytes_stored=1683178169 cache_chunk_bytes_fetched=0 cache_chunk_download_time=1.039788ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=1 cache_stats_results_hit=1 cache_stats_results_download_time=5.104µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=1 cache_result_hit=0 cache_result_download_time=8.033µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=1899 index_post_bloom_filter_chunks=1899 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:39:40.231+05:30 level=info ts=2025-08-06T05:09:40.180580397Z caller=roundtrip.go:359 org_id=fake msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" start=2025-08-02T18:00:00Z end=2025-08-03T17:00:00Z start_delta=83h9m40.180575384s end_delta=60h9m40.180575663s length=23h0m0s step=3600000 query_hash=141503800
2025-08-06T10:40:45.431+05:30 level=info ts=2025-08-06T05:10:45.393488513Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=23h0m0s start_delta=83h10m45.393467073s end_delta=60h10m45.393467204s step=1h0m0s duration=1m5.212606515s status=200 limit=100 returned_lines=0 throughput=4.4GB total_bytes=284GB total_bytes_structured_metadata=37GB lines_per_second=19066872 total_lines=1243400452 post_filter_lines=1243400452 total_entries=4 store_chunks_download_time=12m0.288678055s queue_time=175.751401ms splits=23 shards=368 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=4.302694283s cache_chunk_req=85364 cache_chunk_hit=32 cache_chunk_bytes_stored=73082618051 cache_chunk_bytes_fetched=5983215 cache_chunk_download_time=49.557595ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=24 cache_stats_results_hit=24 cache_stats_results_download_time=129.781µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=23 cache_result_hit=0 cache_result_download_time=95.517µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=85364 index_post_bloom_filter_chunks=85364 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:40:45.831+05:30 level=info ts=2025-08-06T05:10:45.760505145Z caller=roundtrip.go:359 org_id=fake msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" start=2025-08-01T18:00:00Z end=2025-08-02T17:00:00Z start_delta=107h10m45.76050037s end_delta=84h10m45.760500837s length=23h0m0s step=3600000 query_hash=141503800
2025-08-06T10:41:46.231+05:30 level=info ts=2025-08-06T05:11:46.137238441Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=23h0m0s start_delta=107h11m46.137215557s end_delta=84h11m46.13721568s step=1h0m0s duration=1m0.3764261s status=200 limit=100 returned_lines=0 throughput=4.4GB total_bytes=263GB total_bytes_structured_metadata=35GB lines_per_second=19505412 total_lines=1177667120 post_filter_lines=1177667120 total_entries=4 store_chunks_download_time=10m58.123142771s queue_time=175.835447ms splits=23 shards=368 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=3.45767226s cache_chunk_req=87328 cache_chunk_hit=129 cache_chunk_bytes_stored=68205918315 cache_chunk_bytes_fetched=58677775 cache_chunk_download_time=58.424978ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=24 cache_stats_results_hit=24 cache_stats_results_download_time=104.218µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=23 cache_result_hit=0 cache_result_download_time=94.588µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=87328 index_post_bloom_filter_chunks=87328 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T10:41:46.931+05:30 level=info ts=2025-08-06T05:11:46.926211075Z caller=roundtrip.go:359 org_id=fake msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" start=2025-07-31T18:00:00Z end=2025-08-01T17:00:00Z start_delta=131h11m46.926207448s end_delta=108h11m46.926207965s length=23h0m0s step=3600000 query_hash=141503800
2025-08-06T10:42:49.831+05:30 level=info ts=2025-08-06T05:12:49.811139295Z caller=metrics.go:237 component=frontend org_id=fake latency=slow query="sum by (level, detected_level) (count_over_time({service_name=\"ssai\"} |= `` | drop __error__[1h]))" query_hash=141503800 query_type=metric range_type=range length=23h0m0s start_delta=131h12m49.811110068s end_delta=108h12m49.811110191s step=1h0m0s duration=1m2.884631985s status=200 limit=100 returned_lines=0 throughput=4.4GB total_bytes=278GB total_bytes_structured_metadata=38GB lines_per_second=20133710 total_lines=1266100972 post_filter_lines=1266100972 total_entries=4 store_chunks_download_time=11m26.897681013s queue_time=503.662505ms splits=23 shards=368 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=5.605731355s cache_chunk_req=96229 cache_chunk_hit=1 cache_chunk_bytes_stored=72613362090 cache_chunk_bytes_fetched=608191 cache_chunk_download_time=57.692767ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=24 cache_stats_results_hit=24 cache_stats_results_download_time=134.333µs cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=23 cache_result_hit=0 cache_result_download_time=98.597µs cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=96229 index_post_bloom_filter_chunks=96229 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
2025-08-06T11:19:49.531+05:30 level=info ts=2025-08-06T05:49:49.482089139Z caller=roundtrip.go:414 org_id=fake msg="executing query" type=labels label= length=6h0m0s query=
2025-08-06T11:19:50.231+05:30 level=info ts=2025-08-06T05:49:50.146402913Z caller=metrics.go:292 component=frontend org_id=fake latency=fast query_type=labels splits=2 start=2025-08-05T23:49:41.519Z end=2025-08-06T05:49:41.519Z start_delta=6h0m8.627397884s end_delta=8.6273984s length=6h0m0s duration=664.111405ms status=200 label= query= query_hash=2166136261 total_entries=7 cache_label_results_req=0 cache_label_results_hit=0 cache_label_results_stored=0 cache_label_results_download_time=0s cache_label_results_query_length_served=0s
  1. Check Loki metrics and see if you see any error or delay on S3 operations.
    Ans: From querier logs, some queries marked with latency=slow query
level=info ts=2025-08-07T03:50:20.016511512Z caller=metrics.go:237 component=querier org_id=fake latency=slow query="sum by (level,detected_level)(count_over_time({service_name=\"app\"} | drop __error__[1m]))" query_hash=3585720562 query_type=metric range_type=range length=59m0s start_delta=4h50m20.01646929s end_delta=3h51m20.01646954s step=1m0s duration=12.41129697s status=200 limit=100 returned_lines=0 throughput=54MB total_bytes=677MB total_bytes_structured_metadata=77MB lines_per_second=207124 total_lines=2570680 post_filter_lines=2570680 total_entries=191 store_chunks_download_time=4.354657942s queue_time=91.774µs splits=0 shards=0 query_referenced_structured_metadata=true pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=5.570947ms cache_chunk_req=203 cache_chunk_hit=0 cache_chunk_bytes_stored=193827985 cache_chunk_bytes_fetched=0 cache_chunk_download_time=118.976µs cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s cache_result_query_length_served=0s cardinality_estimate=0 ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=0 ingester_requests=0 ingester_chunk_head_bytes=0B ingester_chunk_compressed_bytes=0B ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=0 congestion_control_latency=0s index_total_chunks=203 index_post_bloom_filter_chunks=203 index_bloom_filter_ratio=0.00 index_used_bloom_filters=false index_shard_resolver_duration=0s source=logvolhist disable_pipeline_wrappers=false has_labelfilter_before_parser=false
  1. How big are your logs? Let’s say ball park size per day.
    Ans:

    • Approximate daily logs:
      • Data Volume: 40–45 GiB/day
      • Line Count: ~466.7 million lines/day
  2. How many queriers do you usually run? What’s the time frame you want to be able to query from?
    Ans:

    • Querier Instances: ~100 queriers during performance testing
    • Query Windows Tested :
      • 1 day: ~1m 10s
      • 3, 5, 7 days: progressively slower — testing to understand impact

Additionally, we increased the chunk_target_size from 1.5 MB to 20 MB to reduce the number of PUT requests to the S3 bucket and minimize the number of chunk objects fetched during queries, as we suspect S3 read latency is contributing to performance issues. However, after applying this change, the querier pods became unresponsive and started crashing (BackOffRestarting), making it impossible to retrieve logs from the containers.

We have already referred to the official Loki query performance blog (link), but haven’t seen significant improvements.

Thanks!