Loki query performance very slow or timing out in simple‑scalable mode (~500 GB / 7 days)

We run Loki with the grafana/loki Helm chart in simple‑scalable mode. We’d like to support simple free‑text searches over about 7 days of logs (~500 GB). Even very basic LogQL filters take 4–5 minutes or fail with “Content Deadline exceeded.” We tried the microservices mode as well, but it was slightly slower for us. Based on Grafana’s sizing docs, our resource sizing should be sufficient for this volume. We’re looking for help to identify misconfigurations or next steps to improve query performance.

Environment

  • Helm chart version: 6.24.0

  • Object storage: MinIO (S3 compatible) on-cluster

  • Deployment mode: simple‑scalable (also tried microservices)

  • Log volume: ~500 GB over 7 days

  • Sizing reference we used: https://grafana.com/docs/loki/latest/setup/size/

  • Our values.yaml Loki-config:

    Values.yaml
    monitoring:
      serviceMonitor:
        enabled: true
      selfMonitoring:
        enabled: false
        grafanaAgent:
          installOperator: false
    
    rbac:
      namespaced: true
    
    lokiCanary:
      enabled: false
    
    test:
      enabled: false
    
    read:
      resources:
        requests:
          memory: 15G
          cpu: 4
        limits:
          memory: 15G
          cpu: 20
    
    backend:
      resources:
        requests:
          memory: 1.5G
          cpu: 400m
        limits:
          memory: 1.5G
          cpu: 2
      persistence:
        size: 1Gi
    
    write:
      resources:
        requests:
          memory: 5G
          cpu: 1
        limits:
          memory: 5G
          cpu: 5
      persistence:
        size: 1Gi
    
    
    minio:
      resources:
        requests:
          memory: 22G
          cpu: 4
        limits:
          memory: 22G
          cpu: 20
      enabled: true
      persistence:
        size: 2Ti
      metrics:
        serviceMonitor:
          enabled: true
    
    chunksCache:
      resources:
        requests:
          memory: 25Gi
          cpu: 500m
        limits:
          memory: 25Gi
          cpu: 2.5
    
    resultsCache:
      enabled: true
      resources:
        requests:
          memory: 2Gi
          cpu: 200m
        limits:
          memory: 2Gi
          cpu: 1000m
    
    query_scheduler:
      max_outstanding_requests_per_tenant: 1024
      grpc_client_config:
        max_recv_msg_size: 104857600
        max_send_msg_size: 104857600
    
    frontend_worker:
      grpc_client_config:
        max_recv_msg_size: 104857600
        max_send_msg_size: 104857600
      frontend_address: loki-read:9095
      parallelism: 10
      scheduler_address: loki-read:9095
      match_max_concurrent: true
    
    ingester_client:
      grpc_client_config:
        max_recv_msg_size: 104857600
        max_send_msg_size: 104857600
    
    loki:
      auth_enabled: false # only used for multiple organizations, we dont need it
    
      analytics:
        reporting_enabled: false
        usage_stats_url: ""
    
      server:
        grpc_server_max_recv_msg_size: 104857600
        grpc_server_max_send_msg_size: 104857600
        http_server_read_timeout: 1800s
        http_server_write_timeout: 1800s
        http_server_idle_timeout: 1800s
    
      compactor:
        delete_request_cancel_period: 10m # don't wait 24h before processing the delete_request
        retention_enabled: true # actually do the delete
        retention_delete_delay: 1h # wait 2 hours before actually deleting stuff
        delete_request_store: s3
    
      limits_config:
        retention_period: 90d
        allow_structured_metadata: false
    
        # Query Limits
        tsdb_max_query_parallelism: 2048
        split_queries_by_interval: 15m
        query_timeout: 30m
    
        # Ingestion Limits
        max_streams_per_user: 5000
        max_global_streams_per_user: 5000
        ingestion_rate_mb: 500
        ingestion_burst_size_mb: 1000
        max_line_size: 1048576
        per_stream_rate_limit: 512MB
        per_stream_rate_limit_burst: 1024MB
    
      querier:
        max_concurrent: 16
    
      schemaConfig:
        configs:
            - from: "2025-01-01"
              index:
                period: 24h
                prefix: loki_tsdb_index_
              object_store: s3
              schema: v13
              store: tsdb
    
      frontend:
        tail_proxy_url: http://loki-read:3100
        compress_responses: true
        log_queries_longer_than: 5s
        max_outstanding_per_tenant: 2048
    
      query_range:
        align_queries_with_step: true
        cache_results: true
        max_retries: 10
    
    gateway:
      httpSnippet: |  
        proxy_read_timeout 600s;
        proxy_send_timeout 600s;
        proxy_connect_timeout 60s;
        client_body_timeout 600s;
        client_header_timeout 600s;
      nginxConfig:
        clientMaxBodySize: "100M"
      basicAuth:
        enabled: true
        existingSecret: loki-auth-secret
      resources:
        requests:
          memory: 512Mi
          cpu: 100m
        limits:
          memory: 512Mi
          cpu: 500m
      service:
        port: 443
      ingress:
        annotations:
          nginx.ingress.kubernetes.io/proxy-body-size: "50m"
          nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
          nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
          nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
        enabled: true
        hosts:
          - host: "loki-gateway.com"
            paths:
              - pathType: "Prefix"
                path: "/"
        tls:
          - hosts:
              - "loki-gateway.com"
    
    

What happens

  • Example query over 7 days: {environment=“my-env”} |= “80328591901” | json

  • Typical runtime: 4–5 minutes

  • Often fails with: Content Deadline exceeded

  • No OOMs currently (we initially had some in loki-read, chunks cache, and MinIO, fixed by increasing resources)

What we expected

  • We expected much faster results for simple free‑text searches across 7 days, or at least consistent completion without timeouts.

  • Our original idea was to use Loki for free‑text searches in a similar way we previously used Elasticsearch (where such queries over 30 days worked fine). We understand Loki is not Elasticsearch, but we’re hoping to reach acceptable performance for this scope or learn what changes are needed.

What we tried

  • Adjusted concurrency and various query settings (e.g., split_queries_by_interval from 5m up to 24h) without measurable improvement.

  • Observed that both MinIO and loki-read appear to work normally based on our dashboards.

  • Verified no current OOM events.

Notable observations

  1. In loki-read logs for interactive queries, we see splits=0 and shards=0. However, queries from our alert rules do show splits/shards. We couldn’t find a configuration combination that enables splitting for user-initiated queries.

  2. loki-read shows recurring errors during/after queries:

    • Logs:
      level=error ts=2025-09-08T13:42:16.201676661Z caller=scheduler_processor.go:175 component=querier org_id=fake msg=“error notifying scheduler about finished query” err=EOF addr=10.42.57.46:9095
      level=error ts=2025-09-08T13:42:16.203694204Z caller=scheduler_processor.go:111 component=querier msg=“error processing requests from scheduler” err=“rpc error: code = Canceled desc = context canceled” addr=10.42.57.46:9095
      level=error ts=2025-09-08T13:42:16.203752031Z caller=client.go:469 index-store=tsdb-2025-01-01 msg=“client do failed for instance 10.42.78.205:9095” err="rpc error: code = Canceled desc = context canceled”
  3. MinIO transmit bandwidth peaking around ~215 MB/s during queries.

  4. loki-read receive bandwidth peaking around ~230 MB/s during queries.
    Both then drop to near‑zero after the spike.

Labeling and ingestion

  • Logs are sent by Fluentd. We use labels “App” and “environment” and tried to follow labeling best practices.

Thank you very much for any pointers or configuration review. We suspect a misconfiguration around query splitting/sharding or scheduler/frontends, but we’re not sure where to look next.

Please share your Loki configuration. Also, make sure query frontend is configured properly, otherwise query splitting doesn’t work properly (it splits but doesn’t distribute without query frontend). See Query frontend example | Grafana Loki documentation

Hey, thanks for your answer.

Here is my Loki-Config:

values.yaml
monitoring:
  serviceMonitor:
    enabled: true
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false

rbac:
  namespaced: true

lokiCanary:
  enabled: false

test:
  enabled: false

read:
  resources:
    requests:
      memory: 15G
      cpu: 4
    limits:
      memory: 15G
      cpu: 20

backend:
  resources:
    requests:
      memory: 1.5G
      cpu: 400m
    limits:
      memory: 1.5G
      cpu: 2
  persistence:
    size: 1Gi

write:
  resources:
    requests:
      memory: 5G
      cpu: 1
    limits:
      memory: 5G
      cpu: 5
  persistence:
    size: 1Gi


minio:
  resources:
    requests:
      memory: 22G
      cpu: 4
    limits:
      memory: 22G
      cpu: 20
  enabled: true
  persistence:
    size: 2Ti
  metrics:
    serviceMonitor:
      enabled: true

chunksCache:
  resources:
    requests:
      memory: 25Gi
      cpu: 500m
    limits:
      memory: 25Gi
      cpu: 2.5

resultsCache:
  enabled: true
  resources:
    requests:
      memory: 2Gi
      cpu: 200m
    limits:
      memory: 2Gi
      cpu: 1000m

query_scheduler:
  max_outstanding_requests_per_tenant: 1024
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600

frontend_worker:
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
  frontend_address: loki-read:9095
  parallelism: 10
  scheduler_address: loki-read:9095
  match_max_concurrent: true

ingester_client:
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600

loki:
  auth_enabled: false # only used for multiple organizations, we dont need it

  analytics:
    reporting_enabled: false
    usage_stats_url: ""

  server:
    grpc_server_max_recv_msg_size: 104857600
    grpc_server_max_send_msg_size: 104857600
    http_server_read_timeout: 1800s
    http_server_write_timeout: 1800s
    http_server_idle_timeout: 1800s

  compactor:
    delete_request_cancel_period: 10m # don't wait 24h before processing the delete_request
    retention_enabled: true # actually do the delete
    retention_delete_delay: 1h # wait 2 hours before actually deleting stuff
    delete_request_store: s3

  limits_config:
    retention_period: 90d
    allow_structured_metadata: false

    # Query Limits
    tsdb_max_query_parallelism: 2048
    split_queries_by_interval: 15m
    query_timeout: 30m

    # Ingestion Limits
    max_streams_per_user: 5000
    max_global_streams_per_user: 5000
    ingestion_rate_mb: 500
    ingestion_burst_size_mb: 1000
    max_line_size: 1048576
    per_stream_rate_limit: 512MB
    per_stream_rate_limit_burst: 1024MB

  querier:
    max_concurrent: 16

  schemaConfig:
    configs:
        - from: "2025-01-01"
          index:
            period: 24h
            prefix: loki_tsdb_index_
          object_store: s3
          schema: v13
          store: tsdb

  frontend:
    tail_proxy_url: http://loki-read:3100
    compress_responses: true
    log_queries_longer_than: 5s
    max_outstanding_per_tenant: 2048

  query_range:
    align_queries_with_step: true
    cache_results: true
    max_retries: 10

gateway:
  httpSnippet: |  
    proxy_read_timeout 600s;
    proxy_send_timeout 600s;
    proxy_connect_timeout 60s;
    client_body_timeout 600s;
    client_header_timeout 600s;
  nginxConfig:
    clientMaxBodySize: "100M"
  basicAuth:
    enabled: true
    existingSecret: loki-auth-secret
  resources:
    requests:
      memory: 512Mi
      cpu: 100m
    limits:
      memory: 512Mi
      cpu: 500m
  service:
    port: 443
  ingress:
    annotations:
      nginx.ingress.kubernetes.io/proxy-body-size: "50m"
      nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
      nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
      nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
    enabled: true
    hosts:
      - host: "loki-gateway.com"
        paths:
          - pathType: "Prefix"
            path: "/"
    tls:
      - hosts:
          - "loki-gateway.com"

I dont have distinct query-frontends in the Simple-Scalable Deployment Mode. They are included in each loki-read pod. Splitting/Sharing are working for our alert-queries as I can see it in the logs.
But not for user-queries in grafana.

Switching to Microservice Deployment Mode did not help with the slow performance

While query frontend is part of the read target in simple scalable mode, you still need to configure it. You do not need to switch to micro service mode.