Loki reliability caveats

Hi there,
is there any documentation that explains possible architectural caveats that can lead to lose log lines during ingestion? We’re actually a fluentd/graylog architecture that suffers of this problem (30GB/day scenario) and I would like to understand what do you suggest to address this kind of problems?

Thanks

1 Like

Hi @ltozi - we currently don’t have this in our docs, but I’ve made the recommendation to the Loki team to provide some guidance for this in a blog based on some updates coming in the next release (end of January), so likely sometime in February would be the earliest we could do this by. Thanks so much for bringing this up.

Dangerous times ahead where noone can be 100% what world would be like, so its good to know caveeats and the escape route/s, or even better - how to avoid them…

Seriously though @samcoren: something like this, should have been done long ago

Great, thanks for your feedback!

It’s a mistake to interpret Sam’s answer as “nobody knows”, when all she said was nobody has documented it.

Generally speaking Loki is very good at not losing log lines and as a matter of normal operation we don’t see missing logs. The Loki Canary was built to black box monitor Loki for exactly this purpose, here’s a snapshot of the last 12hours from our internal ops clusters:

A properly sized Loki cluster has very little risk for lost logs, in this image the “websocket missing” are not loss logs, rather logs that didn’t appear over the websocket connection used for live tailing. The canary will query for those log lines to see if they are in Loki, and if they were missing from Loki it would increment the “Missing” counter.

However, no system is infallible and the most likely scenario for losing logs would be an ingester crash without replication or multiple ingester crashes with replication.

Loki builds chunks in memory and while those logs are in memory their protection against loss is based on the replication_factor configured for the cluster. The distributed hash ring assigns log streams across all the ingesters in the cluster and the distributors send logs as well as replicate to additional ingesters for durability.

When a write is sent to Loki, the log lines are guaranteed to be written to (replication_factor/2)-1 ingesters, for example we typically run with a replication factor of 3 meaning a write will not return a 200 unless successfully sent to at minimum 2 ingesters.

Most common cause for crashes are usually hitting a hard memory limit in a memory constrained environment like docker/kubernetes. Panics are rare, but no software is perfect.

Currently we have under test a Write Ahead Log for ingesters, this will further harden Loki against lost logs even in this scenario, every log received is immediately persisted to disk such that it can be re-read upon restart, eliminating the risk of multiple ingester crashes losing data.

Other considerations would be more on the client side, for example with promtail, it’s important the log files you are reading from are sufficiently large so they aren’t rotated too quickly. While promtail does a good job of reading a lot very quickly it at most can keep a handle to the currently open file, if you have a very high volume log files, in the MB/sec range, you can have trouble if the file is size limited to something small for that volume, say 10MB max, you would be rolling a file every few seconds, meanwhile promtail is trying to push batches and read this file while it’s being rotated. (normal file rotation and reading is handled well, promtail will finish reading the current file before moving to the next).

Another client side consideration is the timeouts on failed sends, the default config for promtail will retry a push for about 8minutes before throwing logs away, so if your unable to connect and send to Loki in 8 minutes you will lose logs. Also you need to consider that the log file isn’t being rotated while waiting to send, so you need a log file that can hold 8+minutes of logs or that becomes the short straw for lost logs. Both of these can be configured to extend this deadline if desired.

From here out I think we just end up in more edge cases, like what happens if Loki can’t connect to the object store to send chunks (it will keep them in memory and retry infinitely until it runs out of memory).

Hopefully lost logs is something you don’t encounter, it’s not something we encounter, so long as your log cluster is running, and your 99th% push latencies are say sub 500ms I wouldn’t expect you to have any troubles.

3 Likes

Wow, thanks so much for all the clarification. I didn’t know about Loki Canary

@ewelch is the dashboard you posted screenshot to available as opensource ?

what size has to do with security? Dont get it…

This post was flagged by the community and is temporarily hidden.

There is no need for this kind of language or this type of response.

If you don’t understand or have a different opinion it can be expressed like a professional.

can accept things like logs being lost in case of missed delivery

You either didn’t read or misunderstood what I said.

Logs are not lost on missed delivery, they are retried until the max number of retries is exceeded.

This as well as how long to wait between retries, is configurable.