I’m having some strange issues with Loki, and mainly the ingesters.
These keep eventually reporting a 503 error and looking a the /services endpoint it shows this:
store => Running
ingester => Starting
runtime-config => Running
server => Running
memberlist-kv => Running
ring => Running
The deployment is done using the loki-distributed helm chart.
I tried enabling the debug log-level, but that doesn’t seem to provide any useful information.
The pods keep restarting because of the 503 on the /ready endpoint, and after one pod fails, after a while the next and the next.
The only way until now which i have found which is able to temporarily fix this is by deleting the PVC/Storage of that pod, and restart it. This of course is not a viable solution for the long term.
What are good starting points to investigate these kind of issues?
Is there some way to validate the TSDB files or see if there are large blobs or large logs which might cause issues?
Today 3 pods were restarting over and over again.
I fixed one pod by removing the persistent storage and restarted that container, and then after a short while one of the other containers stopped returning a 503 on the /ready healthcheck endpoint.
I would say double check and make sure all your Loki components can connect to each other on http port, gRPC port (7946), and gossip port. Not just ingester, all components.
Also, to make troubleshooting easier you could also reduce replication_factor to 1 and run just 1 ingester to reduce the noise.
Well, that is kinda the issue I’m having. For some reason some ingester’s stop working.
Setting replication back to 1 and running only one is probably not an option.
Replication to one might be an option, but that isn’t the recommended setup.
While deleting the storage and starting over probably is also not a solution hehe.
I want to figure out how it gets into this state?
Why does all of the sudden, one of the ingester’s stop and report a 503.
There is nothing useful in the logs as far as i could tell. And kinda looks to me, that you have the same conclusion, since the not connecting with other components is exactly what the issues is here.
Well, i changed it to 1, but that didn’t seemed to be really useful.
We changed some other settings, like the amount of ingress bytes etc…
That seemed to solve it mostly.
Looking a bit better at what we received we contacted an other team who seemed to been sending a huge amount of logs compared to before we had issues, and they had some misconfigured items there.
Increasing memory btw did help a bit, but in the end it still crashed/stopped.
So it probably was an issue with processing all the logs coming in.
check per stream rate limit and ingestion rate
loki should drop packets (metric name is “loki_discarded_samples_total”)
if tenant is going overboard with logs
it should not crash the entire plant