We’re currently testing our loki infrastructure resiliency hosted on AWS. Here is the configuration we use.
We took care that each ingester/distributor are spread into 3 different AZ (and we did the same for querier)
The test is a simulation of network failures between AZ A and AZ B.
So basically, it happened that each of our distributor is only considering healthy the ingester located in the same AZ.
Distributor AZ A :
Ingester AZ A : OK
Ingester AZ B: KO
Ingester AZ C: KO
Distributor AZ B :
Ingester AZ A: KO
Ingester AZ B: OK
Ingester AZ C: KO
Distributor AZ C :
Ingester AZ A: KO
Ingester AZ B: KO
Ingester AZ C: OK
We assumed that in that case scenario that
A can only see C (so we thought that A and C should be healthy)
B can only see C (so we thought that B and C should be healthy)
C can see both A and B (so we thought that everything should be healthy)
It seems that everyone must reach an endpoint in order to consider it healthy. Does the case test not possibly covered by Loki yet? Have we missed something?
Hey @romainbonvalot there were some additions made to Cortex (which Loki uses as upstream for some core components) that should enable this to work better. However the documentation and details on this have not trickled their way down to Loki yet.
(this can be done in config file too but then you would need a separate config file for each ingester)
And I think then this should work.
One further consideration, Loki guarantees writes to (replicationFactor / 2) + 1 ingesters, so if you have replication_factor=3 you could only tolerate one ingester being unavailable for writes to succeed. So with 3 AZ’s in your example you could at most tolerate one AZ being unavailable at a time.
Please do respond with questions and updates if you are successful!
Thanks for your reply. It’ll need a few time to me to test because I need to update my ingester statefulset into 3 separated statefulset for the test.
I’m aware that only 1 AZ downtime is possible. It’s acceptable in our case. However, I already test the case of a whole AZ downtime, and it was working because we tricked the allocation of data with the gp2 storage class which force a pod to only work on a dedicated AZ. Combined with taint and node selector we were able to be deterministic doing that.
But it’s not the topic, I’m just afraid because it seems that the solution you provided covers the case of an AZ failure, and in our case it’s the failure of the network between two AZ that we’ve simulated.
I will let you know by the way and thank you again for this quick reply :).
@ewelch. So I did you recommendation. I still have trouble but in a different way.
The cortex ring from each distributor is healthy (they all see of all of the ingesters)
The distributor from AZ C tells that something is wrong happen with ingester a and b (don’t know if we can consider it normal because it can directly reach them but anyway the cortex ring is ok)
ts=2021-01-29T16:18:02.100444174Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to monitoring-loki-ingester-b-0-9ab2878d but other probes failed, network may be misconfigured"
ts=2021-01-29T16:18:03.100592421Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to monitoring-loki-distributor-77b86cb6d6-hx6lr-d655e390 but other probes failed, network may be misconfigured"
ts=2021-01-29T16:18:04.100746492Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to monitoring-loki-ingester-a-0-46523f55 but other probes failed, network may be misconfigured"
ts=2021-01-29T16:18:06.10015135Z caller=memberlist_logger.go:74 level=warn msg="Was able to connect to monitoring-loki-querier-0-dead227f but other probes failed, network may be misconfigured"
And a lot of logs is missing since 17:15 (where I did my test). You can see that in a the grafana screenshot attachment.
When you ran the test, what did you test specifically? Did you break network connectivity to one of the zones? (and which one?)
There is an interesting question I have now on how memberlist is configured and what happens when one of the zones becomes unavailable that I will also ask my peers about.
When the failure is simulated, all nodes remains Ready in a Kubernetes world, because technically, they all be able to reach the control plane. A tie breaker which could kill either A or B in this situation could save it.