We run Loki with the grafana/loki Helm chart in simple‑scalable mode. We’d like to support simple free‑text searches over about 7 days of logs (~500 GB). Even very basic LogQL filters take 4–5 minutes or fail with “Content Deadline exceeded.” We tried the microservices mode as well, but it was slightly slower for us. Based on Grafana’s sizing docs, our resource sizing should be sufficient for this volume. We’re looking for help to identify misconfigurations or next steps to improve query performance.
Environment
-
Helm chart version: 6.24.0
-
Object storage: MinIO (S3 compatible) on-cluster
-
Deployment mode: simple‑scalable (also tried microservices)
-
Log volume: ~500 GB over 7 days
-
Sizing reference we used: https://grafana.com/docs/loki/latest/setup/size/
-
Our values.yaml Loki-config:
Values.yaml
monitoring: serviceMonitor: enabled: true selfMonitoring: enabled: false grafanaAgent: installOperator: false rbac: namespaced: true lokiCanary: enabled: false test: enabled: false read: resources: requests: memory: 15G cpu: 4 limits: memory: 15G cpu: 20 backend: resources: requests: memory: 1.5G cpu: 400m limits: memory: 1.5G cpu: 2 persistence: size: 1Gi write: resources: requests: memory: 5G cpu: 1 limits: memory: 5G cpu: 5 persistence: size: 1Gi minio: resources: requests: memory: 22G cpu: 4 limits: memory: 22G cpu: 20 enabled: true persistence: size: 2Ti metrics: serviceMonitor: enabled: true chunksCache: resources: requests: memory: 25Gi cpu: 500m limits: memory: 25Gi cpu: 2.5 resultsCache: enabled: true resources: requests: memory: 2Gi cpu: 200m limits: memory: 2Gi cpu: 1000m query_scheduler: max_outstanding_requests_per_tenant: 1024 grpc_client_config: max_recv_msg_size: 104857600 max_send_msg_size: 104857600 frontend_worker: grpc_client_config: max_recv_msg_size: 104857600 max_send_msg_size: 104857600 frontend_address: loki-read:9095 parallelism: 10 scheduler_address: loki-read:9095 match_max_concurrent: true ingester_client: grpc_client_config: max_recv_msg_size: 104857600 max_send_msg_size: 104857600 loki: auth_enabled: false # only used for multiple organizations, we dont need it analytics: reporting_enabled: false usage_stats_url: "" server: grpc_server_max_recv_msg_size: 104857600 grpc_server_max_send_msg_size: 104857600 http_server_read_timeout: 1800s http_server_write_timeout: 1800s http_server_idle_timeout: 1800s compactor: delete_request_cancel_period: 10m # don't wait 24h before processing the delete_request retention_enabled: true # actually do the delete retention_delete_delay: 1h # wait 2 hours before actually deleting stuff delete_request_store: s3 limits_config: retention_period: 90d allow_structured_metadata: false # Query Limits tsdb_max_query_parallelism: 2048 split_queries_by_interval: 15m query_timeout: 30m # Ingestion Limits max_streams_per_user: 5000 max_global_streams_per_user: 5000 ingestion_rate_mb: 500 ingestion_burst_size_mb: 1000 max_line_size: 1048576 per_stream_rate_limit: 512MB per_stream_rate_limit_burst: 1024MB querier: max_concurrent: 16 schemaConfig: configs: - from: "2025-01-01" index: period: 24h prefix: loki_tsdb_index_ object_store: s3 schema: v13 store: tsdb frontend: tail_proxy_url: http://loki-read:3100 compress_responses: true log_queries_longer_than: 5s max_outstanding_per_tenant: 2048 query_range: align_queries_with_step: true cache_results: true max_retries: 10 gateway: httpSnippet: | proxy_read_timeout 600s; proxy_send_timeout 600s; proxy_connect_timeout 60s; client_body_timeout 600s; client_header_timeout 600s; nginxConfig: clientMaxBodySize: "100M" basicAuth: enabled: true existingSecret: loki-auth-secret resources: requests: memory: 512Mi cpu: 100m limits: memory: 512Mi cpu: 500m service: port: 443 ingress: annotations: nginx.ingress.kubernetes.io/proxy-body-size: "50m" nginx.ingress.kubernetes.io/proxy-read-timeout: "600" nginx.ingress.kubernetes.io/proxy-send-timeout: "600" nginx.ingress.kubernetes.io/proxy-connect-timeout: "60" enabled: true hosts: - host: "loki-gateway.com" paths: - pathType: "Prefix" path: "/" tls: - hosts: - "loki-gateway.com"
What happens
-
Example query over 7 days: {environment=“my-env”} |= “80328591901” | json
-
Typical runtime: 4–5 minutes
-
Often fails with: Content Deadline exceeded
-
No OOMs currently (we initially had some in loki-read, chunks cache, and MinIO, fixed by increasing resources)
What we expected
-
We expected much faster results for simple free‑text searches across 7 days, or at least consistent completion without timeouts.
-
Our original idea was to use Loki for free‑text searches in a similar way we previously used Elasticsearch (where such queries over 30 days worked fine). We understand Loki is not Elasticsearch, but we’re hoping to reach acceptable performance for this scope or learn what changes are needed.
What we tried
-
Adjusted concurrency and various query settings (e.g., split_queries_by_interval from 5m up to 24h) without measurable improvement.
-
Observed that both MinIO and loki-read appear to work normally based on our dashboards.
-
Verified no current OOM events.
Notable observations
-
In loki-read logs for interactive queries, we see splits=0 and shards=0. However, queries from our alert rules do show splits/shards. We couldn’t find a configuration combination that enables splitting for user-initiated queries.
-
loki-read shows recurring errors during/after queries:
- Logs:
level=error ts=2025-09-08T13:42:16.201676661Z caller=scheduler_processor.go:175 component=querier org_id=fake msg=“error notifying scheduler about finished query” err=EOF addr=10.42.57.46:9095
level=error ts=2025-09-08T13:42:16.203694204Z caller=scheduler_processor.go:111 component=querier msg=“error processing requests from scheduler” err=“rpc error: code = Canceled desc = context canceled” addr=10.42.57.46:9095
level=error ts=2025-09-08T13:42:16.203752031Z caller=client.go:469 index-store=tsdb-2025-01-01 msg=“client do failed for instance 10.42.78.205:9095” err="rpc error: code = Canceled desc = context canceled”
- Logs:
-
MinIO transmit bandwidth peaking around ~215 MB/s during queries.
-
loki-read receive bandwidth peaking around ~230 MB/s during queries.
Both then drop to near‑zero after the spike.
Labeling and ingestion
- Logs are sent by Fluentd. We use labels “App” and “environment” and tried to follow labeling best practices.
Thank you very much for any pointers or configuration review. We suspect a misconfiguration around query splitting/sharding or scheduler/frontends, but we’re not sure where to look next.