Inconsistend Results from queriers

Hello there!

I am experiencing issues similar to those discussed here.

Sadly that solution does not help me.

We have a setup that basically runs 2 instances of Loki with, distributor, querier and ingester. It is modeled on the official docker compose and uses memberlist. There is one query frontend and no external caching.

The query result is inconsistent depending on which backend querier is selected.
On one querier results from the last few minutes are in the result, on the other they are missing. This also happens when querying a querier directly or when going through the loadbalancer.
To be more precise I get a different sub-set of the complete result depending on which querier is selected.

I have attached my configuration.

Could someone please point me into a direction I should investigate next?

target: all

server:
  http_listen_port: 3000
  grpc_listen_port: 3001
  log_level: debug
  http_server_read_timeout: 300s
  http_server_write_timeout: 300s
  http_server_idle_timeout: 120s

memberlist:
  join_members:
    - ${JOIN_MEMBER_URL}
  dead_node_reclaim_time: 30s
  gossip_to_dead_nodes_time: 15s
  left_ingesters_timeout: 30s
  bind_port: 7946

schema_config:
  configs:
    - from: 2020-05-15
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: loki_index_
        period: 24h # Optimal size for boltdb

storage_config:
  aws:
    s3: ${S3_URL}
    insecure: false
    sse_encryption: true

  boltdb_shipper:
    active_index_directory: /tmp/loki_index/index
    shared_store: s3
    cache_location: /tmp/loki_index/boltdb-cache
  index_cache_validity: 5m

chunk_store_config:
  max_look_back_period: 4320h # 180 days

compactor:
  working_directory: /tmp/loki_compactor
  shared_store: s3

ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 2
      heartbeat_timeout: 1m
    num_tokens: 128
    heartbeat_period: 5s
    min_ready_duration: 1m
    join_after: 60s
    observe_period: 5s
    final_sleep: 30s
  chunk_retain_period: 6m

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h

query_range:
  align_queries_with_step: true
  max_retries: 5
  split_queries_by_interval: 15m
  parallelise_shardable_queries: true
  cache_results: true

  results_cache:
    cache:
      enable_fifocache: true
      fifocache:
        size: 1024
        validity: 24h

frontend:
  log_queries_longer_than: 5s
  downstream_url: ${DOWNSTREAM_LOKI_URL}
  tail_proxy_url: ${DOWNSTREAM_LOKI_URL}
  compress_responses: true

querier:
  query_ingesters_within: 2h
  query_timeout: 5m

Forgot to mention that we are running Loki 2.3.

Ok so things get stranger. Looking at the query stats I see these 2 sets of stats depending on which querier is hit
image
or
image

My interpretation is that both ingesters are queried but once I do get results back and the other time I do not.

This makes me wonder if that is a latency problem. But then again I would expect the querier to wait for answers from the ingesters if they can be reached.

What might also be interesting is that as soon as I go to ‘Live’ all results from all nodes are shown.

If I leave my test bed alone, I have noticed that after about 45 minutes the results are in sync. That is pretty much the chunk_idle_period (30m) plus the 15m bold_db shipper index shipper sync. Might be a coincidence though.

Maybe my interpretation is wrong. Anyone got any ideas?

So I figured it out.

Problem was that we are running in AWS Fargate and in Fargate eth0 is a private IP address that basically resolves to localhost.

In the ring the nodes both registered with that address so whenever a query came in, they asked themselves twice for results.
Therefore they reported asking 2 ingesters but only gave the results of one.

Setting interface_names in the ingesters config and instance_interface_names in the distributor config solves the issue.

I now get consistent results across all searches.

Well looking at the ring sooner would have helped…