[Loki Distributed] Control egress traffic for remote storage

We have installed a Loki Distributed stack (Helm chart v0.56.6) that works with GCS as remote storage.

We noticed that the egress traffic for the corresponding bucket increased significantly and is causing continuously high costs.
Now we are looking for settings to control or reduce this.

After numerous tests we found out that we apparently only have to increase the compression interval for the compactor: from originally 2 minutes now to 2 hours has a strong reduction as a result.

      shared_store: gcs
      compaction_interval: 2h
      retention_enabled: true
      retention_delete_delay: 2h


Trying to understand: Could someone explain to us why this is? Why does increasing the interval for the Compactor result in a continuous reduction in egress traffic?

And beside that: Could there be problems with setting the compression interval so high?

It’s explained in pretty good detail here:

Essentially, each of the writer produces an index file of their own. Compactor takes all of them, merge and dedupe, and then produce a new index file. So at minimum you’d have 2 + (NUM_WRITER * 2) number of API calls to GCS, and if you run it often enough that’ll probably stack up quickly (admittedly I haven’t looked into GCS’s cost model…)

Thanks a lot @tonyswumac!

In essence, I meant more the egress traffic from GCS back to our cluster. This always results in the most costs for remote storage.

Therefore I understand the logic of the Compactor and its actions within the cluster-internal Loki. But I don’t get why an increase of the ‘compaction_interval’ results in a reduction of the amount of data from the remote storage back to the cluster.

Do you have an explanation for this?

Perhaps I didn’t understand your original question well enough. Can you give a bit more context to what your original graph represented?

As for egress traffic, wouldn’t that be traffic outbound from your cluster? If so that might make sense, since shorter interval means more download of individual index files and then re-upload of compacted index files. But given that your Loki cluster is likely sending much much more data on chunks, I’d imagine an increase like that shouldn’t be very visible from the cost perspective.

As an example, we have roughly 1.5TB in our chunk storage, and our index is about 1GB. Let’s assume the ratio is the same for all chunk and index uploaded, if I were to increase our compactor frequency by 60 times it would be 60GB additional egress, which compared to the chunk storage it’s just not very noticeable.

Please excuse the inaccuracy in my first post. The graph is from GCS monitoring regarding outgoing traffic from the Loki bucket. The left part of the graph refers to a compaction interval of 2 minutes, the right one 2 hours.

We have about 1.5 TB chunks and 400MB index.

It makes sense according to your explanation, of course, if with a lower interval the indexes have to be downloaded and uploaded again much more often.
The additional storage itself for more indexes is not the problem, but the additional traffic.

All in all, I suspect that the default value of 2 minutes is set very unfavorably and it should be set much higher to keep the network traffic sent costs low.

I am just very unsure at this point, to what extent an interval of 2h can influence the performance for the Querier, for example. Do you have any experience here or any fundamental concerns?

We have the interval set to 1h, hasn’t been a problem for us. Another potential factor to consider is querier.query_ingesters_within, which we set to 2h. I hadn’t considered the interaction between these two configurations before, but now I wonder if there is merit to have query_ingesters_within set longer than interval.

But I’d say index compaction probably doesn’t need to be done all that often, especially if you have caching enabled, so I wouldn’t worry too much about setting it to 1h or 2h.

Thanks @tonyswumac !

Will try the 2h version.

I don’t think the constellation of these two settings could cause problems. Isn’t it even more optimal that query_ingesters_within is greater than compaction_interval, because this always compresses earlier than at the time of querying older logs behind query_ingesters_within? But I hope this is not too simple a thought.

I would agree, I just don’t know for sure. Still need to spend some time and read through some of the code at some point :stuck_out_tongue: