Test of network failure between 2 AZ with a 3 AZ distribution Loki architecture

Hello folks,

We’re currently testing our loki infrastructure resiliency hosted on AWS. Here is the configuration we use.

We took care that each ingester/distributor are spread into 3 different AZ (and we did the same for querier)

The test is a simulation of network failures between AZ A and AZ B.

So basically, it happened that each of our distributor is only considering healthy the ingester located in the same AZ.

Distributor AZ A :

  • Ingester AZ A : OK
  • Ingester AZ B: KO
  • Ingester AZ C: KO

Distributor AZ B :

  • Ingester AZ A: KO
  • Ingester AZ B: OK
  • Ingester AZ C: KO

Distributor AZ C :

  • Ingester AZ A: KO
  • Ingester AZ B: KO
  • Ingester AZ C: OK

We assumed that in that case scenario that

  • A can only see C (so we thought that A and C should be healthy)
  • B can only see C (so we thought that B and C should be healthy)
  • C can see both A and B (so we thought that everything should be healthy)

It seems that everyone must reach an endpoint in order to consider it healthy. Does the case test not possibly covered by Loki yet? Have we missed something?

Thanks in advance!

Here the configuration file we use

auth_enabled: true
server:
  http_listen_port: 3100
  grpc_listen_port: 9095
  grpc_server_max_recv_msg_size: 100000000
  grpc_server_max_send_msg_size: 100000000
  grpc_server_max_concurrent_streams: 1000
  http_server_write_timeout: 1m
  log_level: info
memberlist:
  abort_if_cluster_join_fails: false
  bind_port: 7946
  join_members: 
  - monitoring-loki-gossip-ring.monitoring.svc.cluster.local:7946
  max_join_backoff: 1m
  max_join_retries: 10
  min_join_backoff: 1s
distributor:
  ring: 
    kvstore:
      store: memberlist
ingester:
  chunk_idle_period: 15m
  chunk_encoding: snappy
  chunk_retain_period: 5m
  max_chunk_age: 1h
  max_transfer_retries: 0
  lifecycler:
    ring: 
      kvstore:
        store: memberlist
      replication_factor: 3
ingester_client:
  grpc_client_config:
    max_recv_msg_size: 67108864
  remote_timeout: 1s
querier:
  query_ingesters_within: 2h
query_range:
  split_queries_by_interval: 30m
  align_queries_with_step: true
  cache_results: true
  max_retries: 5
  results_cache:
    cache: 
      redis:
        endpoint: monitoring-loki-redis.monitoring.svc.cluster.local:6379
        expiration: 86400s
frontend:
  compress_responses: true
  max_outstanding_per_tenant: 200
  log_queries_longer_than: 5s
frontend_worker:
  frontend_address: monitoring-loki-query-frontend.monitoring.svc.cluster.local:9095
  parallelism: 2
  grpc_client_config:
    max_send_msg_size: 100000000
limits_config:
  ingestion_rate_strategy: global
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_streams_per_user: 0
  max_global_streams_per_user: 10000
  max_cache_freshness_per_query: 10m
schema_config:
  configs: 
  - from: "2021-01-06"
    index:
      period: 24h
      prefix: loki_index_
    object_store: s3
    schema: v11
    store: boltdb-shipper
storage_config:
  index_queries_cache_config: 
    redis:
      endpoint: monitoring-loki-redis.monitoring.svc.cluster.local:6379
      expiration: 86400s
  aws: 
    bucketnames: ****
    http_config:
      idle_conn_timeout: 90s
      insecure_skip_verify: false
      response_header_timeout: 0s
    insecure: false
    s3: s3://eu-west-3
    s3forcepathstyle: true
    sse_encryption: true
  boltdb_shipper: 
    active_index_directory: /data/index
    cache_location: /data/index_cache
    shared_store: s3
compactor:
  working_directory: /data/compactor
  shared_store: s3
chunk_store_config:
  chunk_cache_config: 
    redis:
      endpoint: monitoring-loki-redis.monitoring.svc.cluster.local:6379
      expiration: 86400s
table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

Hey @romainbonvalot there were some additions made to Cortex (which Loki uses as upstream for some core components) that should enable this to work better. However the documentation and details on this have not trickled their way down to Loki yet.

Cortex Zone Awareness

Not everything in that doc is relevant to Loki, specifically the stuff about store gateway and -distributor.shard-by-all-labels do not apply.

However you should be able to set zone_awareness_enabled to true:

ingester:
  chunk_idle_period: 15m
  chunk_encoding: snappy
  chunk_retain_period: 5m
  max_chunk_age: 1h
  max_transfer_retries: 0
  lifecycler:
    ring: 
      kvstore:
        store: memberlist
      replication_factor: 3
      zone_awareness_enabled: true

And then for each ingester add a flag:

-ingester.availability-zone=zoneA

(this can be done in config file too but then you would need a separate config file for each ingester)

And I think then this should work.

One further consideration, Loki guarantees writes to (replicationFactor / 2) + 1 ingesters, so if you have replication_factor=3 you could only tolerate one ingester being unavailable for writes to succeed. So with 3 AZ’s in your example you could at most tolerate one AZ being unavailable at a time.

Please do respond with questions and updates if you are successful!

Thanks for your reply. It’ll need a few time to me to test because I need to update my ingester statefulset into 3 separated statefulset for the test.

I’m aware that only 1 AZ downtime is possible. It’s acceptable in our case. However, I already test the case of a whole AZ downtime, and it was working because we tricked the allocation of data with the gp2 storage class which force a pod to only work on a dedicated AZ. Combined with taint and node selector we were able to be deterministic doing that.

But it’s not the topic, I’m just afraid because it seems that the solution you provided covers the case of an AZ failure, and in our case it’s the failure of the network between two AZ that we’ve simulated.

I will let you know by the way and thank you again for this quick reply :).

@ewelch. So I did you recommendation. I still have trouble but in a different way.

  • The cortex ring from each distributor is healthy (they all see of all of the ingesters)

  • The distributor from AZ C tells that something is wrong happen with ingester a and b (don’t know if we can consider it normal because it can directly reach them but anyway the cortex ring is ok)

      ts=2021-01-29T16:18:02.100444174Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to monitoring-loki-ingester-b-0-9ab2878d but other probes failed, network may be misconfigured"
      ts=2021-01-29T16:18:03.100592421Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to monitoring-loki-distributor-77b86cb6d6-hx6lr-d655e390 but other probes failed, network may be misconfigured"
      ts=2021-01-29T16:18:04.100746492Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to monitoring-loki-ingester-a-0-46523f55 but other probes failed, network may be misconfigured"
      ts=2021-01-29T16:18:06.10015135Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to monitoring-loki-querier-0-dead227f but other probes failed, network may be misconfigured"
    

And a lot of logs is missing since 17:15 (where I did my test). You can see that in a the grafana screenshot attachment.

When you ran the test, what did you test specifically? Did you break network connectivity to one of the zones? (and which one?)

There is an interesting question I have now on how memberlist is configured and what happens when one of the zones becomes unavailable that I will also ask my peers about.

Here the basic schema.

Loki write nodes:

  • Ingester
  • Distributor

Loki read nodes

  • Querier
  • Query frontend

When the failure is simulated, all nodes remains Ready in a Kubernetes world, because technically, they all be able to reach the control plane. A tie breaker which could kill either A or B in this situation could save it.

).

Hello @ewelch by any chance, have you been able to discuss with your peers about this issue?