Highly available Loki on EKS (which is spread across 2 AZs)

sahmadd2x · November 8, 2023, 2:34pm

Hi,
I am trying to bring up a highly available Loki install inside an EKS cluster(which spans across 2 AZs). Each AZ is running a read/write and backend pod. When I kill one side, I lose one read, one write and one backend pods. However I seem to be also losing the ability to read the logs served by the lost writer. I think the querier component fails to serve the logs that were written/cached by the now unavailable write pod. How do I fix that? I want all my logs to be available for query when I lose an AZ. Can you tell me how to fish here?

tonyswumac · November 8, 2023, 4:52pm

Are you using S3 as permanent storage? If so you should only lose what’s in the WAL logs on your failed writer.

But the only way to be truly redundant is to use replication factor of at least 3 for your writer, see The essential config settings you should use so you won’t drop logs in Loki | Grafana Labs…

sahmadd2x · November 8, 2023, 5:53pm

If I disable WAL, will I write/flush more frequently to S3? I think the queries tries to fetch from the cache of the failed writer node and reports fewer logs than there actually are. I want to make Loki HA experience to be more like Elasticsearch. One side fails but you still have access to all the logs.

tonyswumac · November 8, 2023, 6:19pm

You can control how often writer flushes to S3, but that ultimately has nothing to do with redundancy. As mentioned above if you want true redundancy you need replication factor of at least 3. In Elasticsearch it’s the same, you would configure index with replication factor of 2 or above.

Topic		Replies	Views
Logs disappear after restarting Loki on ECS with S3 backend Grafana Loki api , loki , query-help , dashboard	1	52	December 4, 2025
Loki with multiple instances Grafana Loki loki	1	618	June 16, 2023
Loki High Availability Mechanism Grafana Loki	3	2316	August 7, 2023
Behavior when replacing Loki PVs or modifying replicas in simple scalable module Grafana Loki	3	121	February 2, 2025
Architectural compatibility functionality with async-replicated S3 storage Grafana Loki loki , configuration	3	146	July 26, 2024

Highly available Loki on EKS (which is spread across 2 AZs)

Related topics