Loki ingesters keep failing

Hello there,

I’m having some strange issues with Loki, and mainly the ingesters.
These keep eventually reporting a 503 error and looking a the /services endpoint it shows this:

store => Running
ingester => Starting
runtime-config => Running
server => Running
memberlist-kv => Running
ring => Running

The deployment is done using the loki-distributed helm chart.
I tried enabling the debug log-level, but that doesn’t seem to provide any useful information.

The pods keep restarting because of the 503 on the /ready endpoint, and after one pod fails, after a while the next and the next.

The only way until now which i have found which is able to temporarily fix this is by deleting the PVC/Storage of that pod, and restart it. This of course is not a viable solution for the long term.

What are good starting points to investigate these kind of issues?
Is there some way to validate the TSDB files or see if there are large blobs or large logs which might cause issues?

Thanks in advance!

Please share your configuration and perhaps some logs from the ingester pod/container.

Hello @tonyswumac sure thanks for trying to help.

Here is a link to a zip of the logs and the config.
The logs are to large to share in any other way, and i didn’t want to strip the log and maybe remove useful data.

https://sleutelhanger.vyus.nl/#/send/Fzgr867AR-y6_dbC6ZSY1w/ChJYub_d8ucWIONbwjpoBQ

Thanks in advance!

Also, an other strange thing i noticed.

Today 3 pods were restarting over and over again.
I fixed one pod by removing the persistent storage and restarted that container, and then after a short while one of the other containers stopped returning a 503 on the /ready healthcheck endpoint.

I see a lot of these from your logs:

ts=2024-03-27T07:57:29.729370506Z caller=memberlist_logger.go:74 level=debug msg="Failed to join 192.168.42.250:7946: dial tcp 192.168.42.250:7946: connect: connection refused"
ts=2024-03-27T07:57:29.730132538Z caller=memberlist_logger.go:74 level=debug msg="Initiating push/pull sync with:  192.168.109.166:7946"
ts=2024-03-27T07:57:29.732535731Z caller=memberlist_logger.go:74 level=debug msg="Initiating push/pull sync with:  192.168.61.234:7946"
ts=2024-03-27T07:57:29.734908283Z caller=memberlist_logger.go:74 level=debug msg="Failed to join 192.168.179.200:7946: dial tcp 192.168.179.200:7946: connect: connection refused"
ts=2024-03-27T07:57:29.735363418Z caller=memberlist_logger.go:74 level=debug msg="Initiating push/pull sync with:  192.168.61.231:7946"
ts=2024-03-27T07:57:29.737768691Z caller=memberlist_logger.go:74 level=debug msg="Initiating push/pull sync with:  192.168.180.105:7946"
ts=2024-03-27T07:57:29.740454349Z caller=memberlist_logger.go:74 level=debug msg="Initiating push/pull sync with:  192.168.179.204:7946"
ts=2024-03-27T07:57:29.742707493Z caller=memberlist_logger.go:74 level=debug msg="Initiating push/pull sync with:  192.168.42.252:7946"
ts=2024-03-27T07:57:29.745090355Z caller=memberlist_logger.go:74 level=debug msg="Failed to join 192.168.180.104:7946: dial tcp 192.168.180.104:7946: connect: connection refused"
ts=2024-03-27T07:57:29.745320168Z caller=memberlist_logger.go:74 level=debug msg="Initiating push/pull sync with:  192.168.42.248:7946"
ts=2024-03-27T07:57:29.747641066Z caller=memberlist_logger.go:74 level=debug msg="Failed to join 192.168.101.245:7946: dial tcp 192.168.101.245:7946: connect: connection refused"
ts=2024-03-27T07:57:29.748737926Z caller=memberlist_logger.go:74 level=debug msg="Failed to join 192.168.180.106:7946: dial tcp 192.168.180.106:7946: connect: connection refused"
ts=2024-03-27T07:57:29.749587114Z caller=memberlist_logger.go:74 level=debug msg="Initiating push/pull sync with:  192.168.101.249:7946"
ts=2024-03-27T07:57:29.752648713Z caller=memberlist_logger.go:74 level=debug msg="Initiating push/pull sync with:  192.168.180.107:7946"
ts=2024-03-27T07:57:29.755028634Z caller=memberlist_logger.go:74 level=debug msg="Initiating push/pull sync with:  192.168.42.208:7946"
ts=2024-03-27T07:57:29.75677623Z caller=memberlist_logger.go:74 level=debug msg="Failed to join 192.168.42.246:7946: dial tcp 192.168.42.246:7946: connect: connection refused"
ts=2024-03-27T07:57:29.757220956Z caller=memberlist_logger.go:74 level=debug msg="Failed to join 192.168.61.233:7946: dial tcp 192.168.61.233:7946: connect: connection refused"

I would say double check and make sure all your Loki components can connect to each other on http port, gRPC port (7946), and gossip port. Not just ingester, all components.

Also, to make troubleshooting easier you could also reduce replication_factor to 1 and run just 1 ingester to reduce the noise.

Well, that is kinda the issue I’m having. For some reason some ingester’s stop working.
Setting replication back to 1 and running only one is probably not an option.
Replication to one might be an option, but that isn’t the recommended setup.
While deleting the storage and starting over probably is also not a solution hehe.

I want to figure out how it gets into this state?
Why does all of the sudden, one of the ingester’s stop and report a 503.
There is nothing useful in the logs as far as i could tell. And kinda looks to me, that you have the same conclusion, since the not connecting with other components is exactly what the issues is here.

I just meant changing the replication_factor as a troubleshooting step. Sounds like your cluster was working fine at some point? What changed?

Also look at the infrastructure side and see if your ingesters are being killed because of resource contention.

Well, i changed it to 1, but that didn’t seemed to be really useful.
We changed some other settings, like the amount of ingress bytes etc…
That seemed to solve it mostly.

Looking a bit better at what we received we contacted an other team who seemed to been sending a huge amount of logs compared to before we had issues, and they had some misconfigured items there.

Increasing memory btw did help a bit, but in the end it still crashed/stopped.
So it probably was an issue with processing all the logs coming in.

check per stream rate limit and ingestion rate
loki should drop packets (metric name is “loki_discarded_samples_total”)
if tenant is going overboard with logs
it should not crash the entire plant