Loki query too slow

Hello!

I’m not very proficient in English, so please excuse the use of a translation tool.

I’m currently using Datadog, but due to the high cost of log ingestion, I’m planning to switch to Loki. Since I haven’t worked with Kubernetes before, I built the system using ECS and Docker, and I adopted an SSD deployment approach.

In Datadog, about 1 TB of logs are ingested per day, and roughly 250 GB of that is moved daily to the Grafana Loki system. While both the write servers and backend servers are performing well, I’m quite troubled by the poor performance of the read servers.

Write Server Specifications

  • EC2: c7g.xlarge (20 instances)
  • vCPU: 2, Memory: 3.75 GB, Tasks: 40

Log Traffic

  • 250 GB per day (with plans to increase to 1 TB in the future)

What I Want to Achieve

  • To search for fields like “user” within 10 seconds over a 250 GB (24-hour range) dataset
  • To search for fields other than “user” (which are hard to specify) within the time limits over a 250 GB (24-hour range) dataset

In Datadog, whether searching for “user” over a 24- or 48-hour range or even information that isn’t usually searched, the results are returned within 10 seconds.

However, on my server, in order to search for “user”, I have to narrow the time range to 30 minutes, and if I try a range longer than that, the server sometimes goes down and returns a 502 error.

Is it that my write infrastructure specifications are insufficient? Or do my requirements better suit an ELK stack setup? I started with Grafana and Loki to reduce costs and to try Tempo, but I’m worried that the expenses might end up being much higher than expected. Although I’m using spot EC2 instances for about 80% of the write servers, I’m curious what specifications others are running on.

Just in case, I’m also sharing my Loki configuration. Thank you for reading this long message.

auth_enabled: true 

server:
  http_listen_port: 3100  
  log_level: warn  
  grpc_listen_port: 9095  
  grpc_server_max_recv_msg_size: 67108864 
  grpc_server_max_send_msg_size: 67108864

memberlist:
  join_members: ****
  dead_node_reclaim_time: 30s
  gossip_to_dead_nodes_time: 15s
  left_ingesters_timeout: 30s
  bind_addr: ['0.0.0.0']
  bind_port: 7946
  gossip_interval: 2s
  rejoin_interval: 1m 

common:
  path_prefix: /loki 
  replication_factor: 2  
  compactor_address: ****
  ring:
    kvstore:
      store: memberlist

ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist 
  chunk_idle_period: 15m 
  chunk_retain_period: 1m 
  autoforget_unhealthy: true
  wal:
    flush_on_shutdown: true

schema_config:
  configs:
    - from: 2023-01-01 
      store: tsdb  
      object_store: aws  
      schema: v13  
      index:
        prefix: loki_index_  
        period: 24h  

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/index_cache
    cache_ttl: 24h        
  aws:
    bucketnames: ****
    region: ****

limits_config:
  reject_old_samples: true 
  reject_old_samples_max_age: 168h 
  volume_enabled: true  
  allow_structured_metadata: true
  max_line_size_truncate: true 
  split_queries_by_interval: 30m 
  tsdb_max_query_parallelism: 512 

querier:
  max_concurrent: 4 

compactor:
  working_directory: /loki/compactor 

frontend:
  # (The frontend configuration details were repeated in your message.)

Loki query performance largely comes from distribution. You’ll want to make sure your query frontend is properly configured, and enable query splitting. See Query frontend example | Grafana Loki documentation

Thank you for your reply!

The log below from the read server appears to be from the query-frontend, and seeing “splits=24 shards=27” suggests that splitting is working. When I actually executed the log, I noticed that the CPU usage on all my read servers momentarily hit 100%, which indicates that requests are being sent in parallel to each querier.

level=info
ts=2025-03-20T05:42:09.255467517Z
caller=metrics.go:237
component=frontend
org_id=****
traceID=5e7202074a9193e6
latency=slow
query="{service_name=\"****\"} | json | logfmt | drop __error__, __error_details__ "
query_hash=1983336782
query_type=limited
range_type=range
length=24h0m0s
start_delta=24h1m3.38145054s
end_delta=1m3.38145068s
step=5m45s
duration=23.949102843s
status=200
limit=100
returned_lines=0
throughput=398kB
total_bytes=9.5MB
total_bytes_structured_metadata=24kB
lines_per_second=19
total_lines=474
post_filter_lines=474
total_entries=100
store_chunks_download_time=509.833717ms
queue_time=6.049993ms
splits=24
shards=27
query_referenced_structured_metadata=false
pipeline_wrapper_filtered_lines=0
chunk_refs_fetch_time=14.296353ms
cache_chunk_req=108
cache_chunk_hit=0
cache_chunk_bytes_stored=83627715
cache_chunk_bytes_fetched=0
cache_chunk_download_time=67.208µs
cache_index_req=0
cache_index_hit=0
cache_index_download_time=0s
cache_stats_results_req=24
cache_stats_results_hit=24
cache_stats_results_download_time=183.234µs
cache_volume_results_req=0
cache_volume_results_hit=0
cache_volume_results_download_time=0s
cache_result_req=0
cache_result_hit=0
cache_result_download_time=0s
cache_result_query_length_served=0s
cardinality_estimate=0
ingester_chunk_refs=0
ingester_chunk_downloaded=0
ingester_chunk_matches=0
ingester_requests=21
ingester_chunk_head_bytes=0B
ingester_chunk_compressed_bytes=0B
ingester_chunk_decompressed_bytes=0B
ingester_post_filter_lines=0
congestion_control_latency=0s
index_total_chunks=0
index_post_bloom_filter_chunks=0
index_bloom_filter_ratio=0.00
index_used_bloom_filters=false
index_shard_resolver_duration=0s
disable_pipeline_wrappers=false
has_labelfilter_before_parser=false

In the link you provided, it shows that the frontend_address is specified. Does that mean the queuing functionality of the query-scheduler cannot be used? I have also confirmed that the queriers execute queries by splitting them according to the configured split_queries_by_interval: 30m.

level=info
ts=2025-03-19T08:32:13.649205742Z
caller=metrics.go:237
component=querier
org_id=****
traceID=3d731da622916980
latency=slow
query="{service_name=\"****\"} | json | logfmt | user_targetCharacter_id=\"302265\" | drop __error__,__error_details__"
query_hash=597642425
query_type=filter
range_type=range
length=30m0s
start_delta=11h32m13.649191474s
end_delta=11h2m13.649191712s
step=1m0s
duration=25.878411532s
status=200
limit=1000
returned_lines=11
throughput=52MB
total_bytes=1.3GB
total_bytes_structured_metadata=3.7MB
lines_per_second=2852
total_lines=73810
post_filter_lines=21
total_entries=11
store_chunks_download_time=2.240888994s
queue_time=7.18048957s
splits=0
shards=0
query_referenced_structured_metadata=false
pipeline_wrapper_filtered_lines=0
chunk_refs_fetch_time=371.656µs
cache_chunk_req=200
cache_chunk_hit=0
cache_chunk_bytes_stored=268551428
cache_chunk_bytes_fetched=0
cache_chunk_download_time=107.487µs
cache_index_req=0
cache_index_hit=0
cache_index_download_time=0s
cache_stats_results_req=0
cache_stats_results_hit=0
cache_stats_results_download_time=0s
cache_volume_results_req=0
cache_volume_results_hit=0
cache_volume_results_download_time=0s
cache_result_req=0
cache_result_hit=0
cache_result_download_time=0s
cache_result_query_length_served=0s
cardinality_estimate=0
ingester_chunk_refs=0
ingester_chunk_downloaded=0
ingester_chunk_matches=0
ingester_requests=0
ingester_chunk_head_bytes=0B
ingester_chunk_compressed_bytes=0B
ingester_chunk_decompressed_bytes=0B
ingester_post_filter_lines=0
congestion_control_latency=0s
index_total_chunks=0
index_post_bloom_filter_chunks=0
index_bloom_filter_ratio=0.00
index_used_bloom_filters=false
index_shard_resolver_duration=0s
disable_pipeline_wrappers=false
has_labelfilter_before_parser=false

In the querier logs, I see that in all cases splits=0 shards=0 appear, and I wonder if this is normal.

Also, for query_type=metric, there are cases where the store_chunks_download_time exceeds 10 seconds. In such cases, implementing a chunk cache might help, but given that the system processes 1TB per day, I am still considering whether implementing a chunk cache would be overkill. (If you think it would be advisable, please share your thoughts.)

Even considering each time, the duration is too high, so is the only way to increase the number of queriers?

I apologize for asking so many questions at once, and I would appreciate your feedback.

  1. Just because your queries are being split doesn’t mean they are being distributed.
  2. Query frontend can operate in push or pull mode, you can see the difference in the documentation I linked above. I personally use the pull mode (which is done by setting frontend_address).
  3. You might also consider changing split_queries_by_interval to 1h.

Again I would confirm your queries are actually being distributed to all queriers. If you are using pull mode, there is a metrics from query frontend loki_query_frontend_queue_length that you can use to tell if queries are being split, and how quickly it goes down. There are also other metrics that you can use to determine number of workers connected.

Couple of other things:

  1. Can you share your query frontend configuration?
  2. Can you use LogQL to perform a query against your Loki cluster? LogQL will give you a performance overview.
  3. How many queriers do you have? How much resources are they allocated with?