Inconsist query response when querying ingesters

gilesr · June 22, 2021, 10:48am

When querying our Loki cluster we’re seeing some inconsistency in query results for queries that hit the ingesters (queries for the most recently ingested logs that haven’t yet been written to the chunk store). We have the querier configured to query the ingesters within 2 hours. Any queries inside this time range seem to jump between different subsets of results (i.e. run a query and get one set of logs, run the same query again and get a different set of logs, run it again and get the first set of logs again).

I’m not sure if this is expected behaviour, or a problem with how we have configured Loki. How does the querier work under the hood when querying ingesters? Would it send the query to all ingesters or only a subset? We don’t see the same behaviour for queries covering older time ranges that don’t hit the ingesters.

Our cluster is set up as follows:

We used loki/production/docker at main · grafana/loki · GitHub as a starting point for our cluster setup, currently running (in AWS ECS):

3 Loki servers, all enabled with distributor, ingester, querier
Memberlist for service discovery
Boltdb-shipper for the index
S3 for chunk/index storage
A separate instance running a query frontend
Redis for index query/chunk/query results caches

Our Loki instances have the following configuration (The ${…} placeholders get populated at deployment time):

auth_enabled: false

http_prefix:

server:
  http_listen_address: 0.0.0.0
  grpc_listen_address: 0.0.0.0
  http_listen_port: ${loki_http_port}
  grpc_listen_port: ${loki_grpc_port}
  log_level: ${loki_log_level}

memberlist:
  join_members:
    - ${loki_sd_dns}
  abort_if_cluster_join_fails: false
  max_join_backoff: 1m
  max_join_retries: 10
  min_join_backoff: 1s
  dead_node_reclaim_time: 30s
  gossip_to_dead_nodes_time: 15s
  left_ingesters_timeout: 30s
  bind_addr: ['0.0.0.0']
  bind_port: ${loki_bind_port}

limits_config:
  ingestion_rate_strategy: local
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  max_streams_per_user: 0

ingester:
  lifecycler:
    join_after: 60s
    final_sleep: 0s
    ring:
      replication_factor: 3
      heartbeat_timeout: 60s
      kvstore:
        store: memberlist
  chunk_retain_period: 30s
  chunk_idle_period: 15m
  chunk_block_size: 262144
  chunk_target_size: 1536000
  max_transfer_retries: 0
  wal:
    enabled: true
    dir: /loki/wal
    flush_on_shutdown: true
    replay_memory_ceiling: 1GB

distributor:
 ring:
   kvstore:
     store: memberlist

schema_config:
  configs:
  - from: 2021-05-01
    store: boltdb-shipper
    object_store: s3
    schema: v11
    index:
      prefix: loki_index_
      period: 24h

storage_config:
  aws:
    s3: ${loki_s3_bucket}
    sse_encryption: true
    insecure: false
    s3forcepathstyle: true
  boltdb_shipper:
    shared_store: s3
    active_index_directory: /loki/index
    cache_location: /loki/boltdb-cache
  index_cache_validity: 14m
  index_queries_cache_config:
    redis:
      endpoint: ${loki_redis_endpoint}
      timeout: 1s
      db: 1

chunk_store_config:
  max_look_back_period: 8736h
  chunk_cache_config:
    redis:
      endpoint: ${loki_redis_endpoint}
      timeout: 1s
      db: 2

table_manager:
  retention_deletes_enabled: true
  retention_period: 8736h

query_range:
  # make queries more cache-able by aligning them with their step intervals
  align_queries_with_step: true
  max_retries: 5
  # parallelize queries in 15min intervals
  split_queries_by_interval: 15m
  parallelise_shardable_queries: true
  cache_results: true
  results_cache:
    cache:
      redis:
        endpoint: ${loki_redis_endpoint}
        timeout: 1s
        db: 0

frontend:
  log_queries_longer_than: 5s
  compress_responses: true
  tail_proxy_url: ${loki_query_backend_url}

frontend_worker:
  frontend_address: ${loki_query_local_dns}:${loki_grpc_port}
  grpc_client_config:
    max_send_msg_size: 1.048576e+08
  
querier:
  query_ingesters_within: 2h

sandeeps · June 24, 2021, 8:47am

One problem I suspect could be your chunk_retain_period being less than index_cache_validity while it should be the other way around. With this setting, you are serving stale results from your cache for 14 mins while your ingesters have already dropped backing data, without which the results would be incomplete. We internally set them to 6 mins and 5 mins respectively. You can reduce them to a lower value in the same proportion.

gilesr · June 30, 2021, 7:25am

Despite making the above config changes I still see the same issue with search results jumping around on refresh

gilesr · June 30, 2021, 7:30am

Actually, forget that. I think that was due to two different sets of ingesters being present in the ring during a deployment. Once the older set dropped out we no longer see the issue

system · June 30, 2022, 7:30am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Inconsistend Results from queriers Grafana Loki query-help	3	1132	September 10, 2022
Loki Querier not able to query Ingester? Grafana Loki	2	1406	May 19, 2022
Inconsistent results for each query Grafana Loki query-help	1	320	February 23, 2024
Error "Empty results, no matching label" Grafana Loki loki , query-help	27	1077	March 16, 2025
Queries Process to many Bytes Grafana Loki	8	2328	August 24, 2022

Inconsist query response when querying ingesters

Related topics