Hi,
I am trying to debug an issue with the Loki in Monolithic mode, in a High Availability (HA) configuration, running in a Kubernetes cluster with 3 nodes.
I am doing HA tests, when all 3 nodes are up then svc/loki-gateway proxies to svc/loki and it hits one of the three loki pods replicas from the stateful set. When I do get a simple Loki query in Grafana I get a response back in ~100ms reliably every time. However, If I turn off one of the nodes (unexpected shutdown) then the loki pod running on that node eventually is left in a terminating state and a kubectl describe shows Conditions.Ready as false. The problem is that now when I do the same simple Loki query in Grafana I get a timeout in about 1 in 3 attempts. svc/loki-gateway still proxies to svc/loki, but I see svc/loki still has the terminating pod’s IP as one of the endpoints. I believe this is because svc/loki has spec.publishNotReadyAddresses: true.
Is there a workaround or config for this that I’m missing? To me the pod’s Ip should be removed from the endpoints load balanced by svc/loki or at least svc/loki-gateway should fail fast and retry automatically when svc/loki serves it the terminating pod. Any advice on this would be welcome.
Thank You