Architectural compatibility functionality with async-replicated S3 storage

Dear all,

I have an architectural question to submit to your software architects and developers about the following scenario:

I have two independent Kubernetes cluster, each of them on two physical geographically far-away different locations. For the scope of this post, will be PS (Primary-Site) and RS (Remote-Site) respectively.

On the PS site, I have the Promptail + Loki stack deployed inside K8S, in monolithic flavor (but I can switch to the Simple Scalable or the Microservices mode without problems) which collect K8S cluster-wide pods logs and store it on a S3 compatible storage, located in the same site.

The S3 storage software, handle by itself the replication of the contents of the buckets to the RS site, where another Loki instance will read the data written by “it’s brother of the PS site” and will be then queried by Grafana for data visualization.

On the RS site, Loki doesn’t receive any kind of log, and it can have all the “write” features disabled, since no log ingestion is required.

If the replication of the S3 storage between the two sites is synchronous, I’m expecting the architecture to works out-of-the-box due to this architectural schema https://grafana.com/docs/loki/latest/get-started/microservices-mode.png from https://grafana.com/docs/loki/latest/get-started/deployment-modes/#microservices-mode documentation page: there are not direct interactions between the “writer” and the “reader” components: the only common point is the “Cloud Storage”.

But, unfortunately, since the nature of the sites, the synchronous replication can’t be achieved because of the poor quality of the network in some periods, that S3 storage software handle correctly, but this for sure add to the equation the following problem:

What happen to the RS Loki instance, in case it’s found on its site storage some indexes, but without the referenced chunks, or the opposite: the chunks but not the index, because some of them are still in replication queue of the S3 software?

I think the latter case will not be an issue: if Loki doesn’t have the indexes, it’s doesn’t know the existence of the chunk object, so the chunk simply will stay on the storage, and nothing happen.

But, the former case, may be a problem because Loki knows the index, so all chunk referenced by it; but all or some of it, doesn’t exist on the storage yet because of the asynchronous replication.

Another problem is how tell the RS Loki to periodically “rescan” the S3 storage, to hook up the new indexes object that may have been replicated, and unhook the indexes (and so the chunks) that may have been deleted on the PR site due to retention expiration; but since the monolithic architecture with the shared S3 storage, I think all of this concepts has already been implemented in some way.

Do you have some advice and can you help me?

Kind Regards

Simon

Neither is problematic in terms of cluster stability. Your users will see errors however if indexes are present but chunks are not.

Hi Tonyswumac,

Thank you for the reply.

You can also help me or link the documentation about how configure the Loki application about how often write new index and chunk for tuning the system?

For example, if the replication of the S3 backends starts every 30 minutes; also Loki has to flush/close the indexes and chunk every 30 minutes, and start with new one.

I see in some references inside the sample configurations about chunk size and caching options, but I’m unable to find detailed documentation about the parameters available with explanations of the behavior of the options.

Thanks
Regards

See Grafana Loki configuration parameters | Grafana Loki documentation. You are probably looking at chunk_block_size, chunk_idle_period, and max_chunk_age.