I am trying to bring up a highly available Loki install inside an EKS cluster(which spans across 2 AZs). Each AZ is running a read/write and backend pod. When I kill one side, I lose one read, one write and one backend pods. However I seem to be also losing the ability to read the logs served by the lost writer. I think the querier component fails to serve the logs that were written/cached by the now unavailable
write pod. How do I fix that? I want all my logs to be available for query when I lose an AZ. Can you tell me how to fish here?
Are you using S3 as permanent storage? If so you should only lose what’s in the WAL logs on your failed writer.
But the only way to be truly redundant is to use replication factor of at least 3 for your writer, see The essential config settings you should use so you won’t drop logs in Loki | Grafana Labs…
If I disable WAL, will I write/flush more frequently to S3? I think the queries tries to fetch from the cache of the failed writer node and reports fewer logs than there actually are. I want to make Loki HA experience to be more like Elasticsearch. One side fails but you still have access to all the logs.
You can control how often writer flushes to S3, but that ultimately has nothing to do with redundancy. As mentioned above if you want true redundancy you need replication factor of at least 3. In Elasticsearch it’s the same, you would configure index with replication factor of 2 or above.