It’s a mistake to interpret Sam’s answer as “nobody knows”, when all she said was nobody has documented it.
Generally speaking Loki is very good at not losing log lines and as a matter of normal operation we don’t see missing logs. The Loki Canary was built to black box monitor Loki for exactly this purpose, here’s a snapshot of the last 12hours from our internal ops clusters:
A properly sized Loki cluster has very little risk for lost logs, in this image the “websocket missing” are not loss logs, rather logs that didn’t appear over the websocket connection used for live tailing. The canary will query for those log lines to see if they are in Loki, and if they were missing from Loki it would increment the “Missing” counter.
However, no system is infallible and the most likely scenario for losing logs would be an ingester crash without replication or multiple ingester crashes with replication.
Loki builds chunks in memory and while those logs are in memory their protection against loss is based on the replication_factor configured for the cluster. The distributed hash ring assigns log streams across all the ingesters in the cluster and the distributors send logs as well as replicate to additional ingesters for durability.
When a write is sent to Loki, the log lines are guaranteed to be written to (replication_factor/2)-1 ingesters, for example we typically run with a replication factor of 3 meaning a write will not return a 200 unless successfully sent to at minimum 2 ingesters.
Most common cause for crashes are usually hitting a hard memory limit in a memory constrained environment like docker/kubernetes. Panics are rare, but no software is perfect.
Currently we have under test a Write Ahead Log for ingesters, this will further harden Loki against lost logs even in this scenario, every log received is immediately persisted to disk such that it can be re-read upon restart, eliminating the risk of multiple ingester crashes losing data.
Other considerations would be more on the client side, for example with promtail, it’s important the log files you are reading from are sufficiently large so they aren’t rotated too quickly. While promtail does a good job of reading a lot very quickly it at most can keep a handle to the currently open file, if you have a very high volume log files, in the MB/sec range, you can have trouble if the file is size limited to something small for that volume, say 10MB max, you would be rolling a file every few seconds, meanwhile promtail is trying to push batches and read this file while it’s being rotated. (normal file rotation and reading is handled well, promtail will finish reading the current file before moving to the next).
Another client side consideration is the timeouts on failed sends, the default config for promtail will retry a push for about 8minutes before throwing logs away, so if your unable to connect and send to Loki in 8 minutes you will lose logs. Also you need to consider that the log file isn’t being rotated while waiting to send, so you need a log file that can hold 8+minutes of logs or that becomes the short straw for lost logs. Both of these can be configured to extend this deadline if desired.
From here out I think we just end up in more edge cases, like what happens if Loki can’t connect to the object store to send chunks (it will keep them in memory and retry infinitely until it runs out of memory).
Hopefully lost logs is something you don’t encounter, it’s not something we encounter, so long as your log cluster is running, and your 99th% push latencies are say sub 500ms I wouldn’t expect you to have any troubles.