Test of network failure between 2 AZ with a 3 AZ distribution Loki architecture

romainbonvalot · January 29, 2021, 2:02pm

Hello folks,

We’re currently testing our loki infrastructure resiliency hosted on AWS. Here is the configuration we use.

We took care that each ingester/distributor are spread into 3 different AZ (and we did the same for querier)

The test is a simulation of network failures between AZ A and AZ B.

So basically, it happened that each of our distributor is only considering healthy the ingester located in the same AZ.

Distributor AZ A :

Ingester AZ A : OK
Ingester AZ B: KO
Ingester AZ C: KO

Distributor AZ B :

Ingester AZ A: KO
Ingester AZ B: OK
Ingester AZ C: KO

Distributor AZ C :

Ingester AZ A: KO
Ingester AZ B: KO
Ingester AZ C: OK

We assumed that in that case scenario that

A can only see C (so we thought that A and C should be healthy)
B can only see C (so we thought that B and C should be healthy)
C can see both A and B (so we thought that everything should be healthy)

It seems that everyone must reach an endpoint in order to consider it healthy. Does the case test not possibly covered by Loki yet? Have we missed something?

Thanks in advance!

romainbonvalot · January 29, 2021, 2:05pm

Here the configuration file we use

auth_enabled: true
server:
  http_listen_port: 3100
  grpc_listen_port: 9095
  grpc_server_max_recv_msg_size: 100000000
  grpc_server_max_send_msg_size: 100000000
  grpc_server_max_concurrent_streams: 1000
  http_server_write_timeout: 1m
  log_level: info
memberlist:
  abort_if_cluster_join_fails: false
  bind_port: 7946
  join_members: 
  - monitoring-loki-gossip-ring.monitoring.svc.cluster.local:7946
  max_join_backoff: 1m
  max_join_retries: 10
  min_join_backoff: 1s
distributor:
  ring: 
    kvstore:
      store: memberlist
ingester:
  chunk_idle_period: 15m
  chunk_encoding: snappy
  chunk_retain_period: 5m
  max_chunk_age: 1h
  max_transfer_retries: 0
  lifecycler:
    ring: 
      kvstore:
        store: memberlist
      replication_factor: 3
ingester_client:
  grpc_client_config:
    max_recv_msg_size: 67108864
  remote_timeout: 1s
querier:
  query_ingesters_within: 2h
query_range:
  split_queries_by_interval: 30m
  align_queries_with_step: true
  cache_results: true
  max_retries: 5
  results_cache:
    cache: 
      redis:
        endpoint: monitoring-loki-redis.monitoring.svc.cluster.local:6379
        expiration: 86400s
frontend:
  compress_responses: true
  max_outstanding_per_tenant: 200
  log_queries_longer_than: 5s
frontend_worker:
  frontend_address: monitoring-loki-query-frontend.monitoring.svc.cluster.local:9095
  parallelism: 2
  grpc_client_config:
    max_send_msg_size: 100000000
limits_config:
  ingestion_rate_strategy: global
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_streams_per_user: 0
  max_global_streams_per_user: 10000
  max_cache_freshness_per_query: 10m
schema_config:
  configs: 
  - from: "2021-01-06"
    index:
      period: 24h
      prefix: loki_index_
    object_store: s3
    schema: v11
    store: boltdb-shipper
storage_config:
  index_queries_cache_config: 
    redis:
      endpoint: monitoring-loki-redis.monitoring.svc.cluster.local:6379
      expiration: 86400s
  aws: 
    bucketnames: ****
    http_config:
      idle_conn_timeout: 90s
      insecure_skip_verify: false
      response_header_timeout: 0s
    insecure: false
    s3: s3://eu-west-3
    s3forcepathstyle: true
    sse_encryption: true
  boltdb_shipper: 
    active_index_directory: /data/index
    cache_location: /data/index_cache
    shared_store: s3
compactor:
  working_directory: /data/compactor
  shared_store: s3
chunk_store_config:
  chunk_cache_config: 
    redis:
      endpoint: monitoring-loki-redis.monitoring.svc.cluster.local:6379
      expiration: 86400s
table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

ewelch · January 29, 2021, 2:20pm

Hey @romainbonvalot there were some additions made to Cortex (which Loki uses as upstream for some core components) that should enable this to work better. However the documentation and details on this have not trickled their way down to Loki yet.

Cortex Zone Awareness

Not everything in that doc is relevant to Loki, specifically the stuff about store gateway and -distributor.shard-by-all-labels do not apply.

However you should be able to set zone_awareness_enabled to true:

ingester:
  chunk_idle_period: 15m
  chunk_encoding: snappy
  chunk_retain_period: 5m
  max_chunk_age: 1h
  max_transfer_retries: 0
  lifecycler:
    ring: 
      kvstore:
        store: memberlist
      replication_factor: 3
      zone_awareness_enabled: true

And then for each ingester add a flag:

-ingester.availability-zone=zoneA

(this can be done in config file too but then you would need a separate config file for each ingester)

And I think then this should work.

One further consideration, Loki guarantees writes to (replicationFactor / 2) + 1 ingesters, so if you have replication_factor=3 you could only tolerate one ingester being unavailable for writes to succeed. So with 3 AZ’s in your example you could at most tolerate one AZ being unavailable at a time.

Please do respond with questions and updates if you are successful!

romainbonvalot · January 29, 2021, 3:23pm

Thanks for your reply. It’ll need a few time to me to test because I need to update my ingester statefulset into 3 separated statefulset for the test.

I’m aware that only 1 AZ downtime is possible. It’s acceptable in our case. However, I already test the case of a whole AZ downtime, and it was working because we tricked the allocation of data with the gp2 storage class which force a pod to only work on a dedicated AZ. Combined with taint and node selector we were able to be deterministic doing that.

But it’s not the topic, I’m just afraid because it seems that the solution you provided covers the case of an AZ failure, and in our case it’s the failure of the network between two AZ that we’ve simulated.

I will let you know by the way and thank you again for this quick reply :).

romainbonvalot · January 29, 2021, 4:32pm

@ewelch. So I did you recommendation. I still have trouble but in a different way.

The cortex ring from each distributor is healthy (they all see of all of the ingesters)

The distributor from AZ C tells that something is wrong happen with ingester a and b (don’t know if we can consider it normal because it can directly reach them but anyway the cortex ring is ok)

  ts=2021-01-29T16:18:02.100444174Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to monitoring-loki-ingester-b-0-9ab2878d but other probes failed, network may be misconfigured"
  ts=2021-01-29T16:18:03.100592421Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to monitoring-loki-distributor-77b86cb6d6-hx6lr-d655e390 but other probes failed, network may be misconfigured"
  ts=2021-01-29T16:18:04.100746492Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to monitoring-loki-ingester-a-0-46523f55 but other probes failed, network may be misconfigured"
  ts=2021-01-29T16:18:06.10015135Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to monitoring-loki-querier-0-dead227f but other probes failed, network may be misconfigured"

And a lot of logs is missing since 17:15 (where I did my test). You can see that in a the grafana screenshot attachment.

ewelch · January 29, 2021, 6:52pm

When you ran the test, what did you test specifically? Did you break network connectivity to one of the zones? (and which one?)

There is an interesting question I have now on how memberlist is configured and what happens when one of the zones becomes unavailable that I will also ask my peers about.

romainbonvalot · January 30, 2021, 7:22am

Here the basic schema.

Loki write nodes:

Ingester
Distributor

Loki read nodes

Querier
Query frontend

When the failure is simulated, all nodes remains Ready in a Kubernetes world, because technically, they all be able to reach the control plane. A tie breaker which could kill either A or B in this situation could save it.

).

romainbonvalot · February 8, 2021, 9:38am

Hello @ewelch by any chance, have you been able to discuss with your peers about this issue?

system · February 8, 2022, 9:38am

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Network may be partioned, skip forgeting ingesters this round Grafana Loki	1	2047	November 17, 2022
Loki distributor does not load balncing Grafana Loki loki	2	1384	March 4, 2023
Query ingesters in different clusters for in-memory/WAL logs without forcing membership Grafana Loki api , loki	3	244	June 29, 2025
Data not written on both ingesters? Grafana Loki	4	336	August 17, 2024
Distributed Loki - Troubleshooting Ingester on ECS Grafana Loki	1	385	November 12, 2022

Test of network failure between 2 AZ with a 3 AZ distribution Loki architecture

Related topics