I’ve been running some load test scenarios against loki for capacity planning / sizing purposes. We are using the loki-distributed helm chart version 0.78.5 and loki software version 2.9.6 which I believe is the chart version.
For some reason when under load the ingester pods are getting restarted when under load at certain levels. For example 3 ingester pods with 4 cpu and 8gb of ram hitting an ingest rate of approx 200k/sec.
From the kubernetes event stream I see:
Readiness probe failed: HTTP probe failed with statuscode: 503
We pass some threshold in terms of ingest rate and the ingester pods start failing their readiness probe. Even after the load test has ended the ingester pods are still in this state for some time - its been 20 minutes and I’m still waiting to see things return to happy state. We have the autoscale configuration set for both distributor and ingester pods at the 80% cpu threshold chart default and we do scale up both distributor and ingester pods during this test.
Any recommendations on how to go about debugging this? Any thoughts on configuration settings which I should be looking at?
When under load I noticed this which seems like a long time for a single pod to be checkpointing is this pointing at a bottleneck / config issue?
Let me try to address your questions one at a time:
First, you never mentioned why your ingesters failed in the first place. CPU pressure or memory pressure?
You also did not share your test data. What do the test logs look like? How many labels? What is the level of cardinality of the labels?
Unless you are running ingesters on Kubernetes as stateful container, you may want to configure ingester to auto forget unhealthy containers. The reason is ingester (actually readers as well, if I remember correctly) expect a certain number of containers healthy, otherwise it won’t function at all. When your containers are under pressure and fail constantly you will have a bunch of unhealthy containers listed in membership, which can prevent future containers from starting up. You can also try to speed this up by using the Loki container UI (or API call) and remove the unhealthy target.
We don’t have a big cluster, for us we run 4 ingesters, each with 2cpu x 8GB memory. We are taking on average 7K to 10K logs per second (roughly 1MB per second), around 4500 streams (number of log streams with unique set of labels), with CPU and memory pressure at 40%. It’s quite a bit over provisioned, but we do have burst traffic that goes up to 300K logs per second (happens couple of times a day), so we kinda have to be prepared for that. Allocating resources for ingesters can be tricky, because it really depends on the nature of your logs (big, small, number of labels, etc), and how your users send logs to Loki (burst traffic or stable traffic).
If you are running on Kubernetes I believe it’s generally recommended to have smaller but more number of ingesters. You may also consider posting this question on the Grafana community Slack, there are people there who are running much bigger cluster than I who may be able to give you better information.
So the reason the ingester pod is failing is not clear to me. It does not appear to be memory related - the pod is not getting oom killed. Its just failing to process events
As for the test data. This is synthetic data generated by grafana/xk6-loki: k6 extension for Loki (github.com) with the data being between 4 and 5kb in size - its purely synthetic I have no idea what it even looks like!
The ingester pods simply have their liveness probe check failing - its not clear what is failed / why. Here are the config overrides we are specifying for the chart:
We have the ingester pod configured to autoscale at 80% cpu. Each ingester pod gets 4 cpus and 8gb of ram. I’ve thought about scaling the pods down to 2 cpu to see if that performs better - like you say a larger number of smaller pods.
Any thoughts? Especially around the config overrides - we’ve played with a few settings based on the synthetic load testing but we know they are only going to get us in the ballpark of where we will need to be for real load ( and yes I know many settings will just depend on the actual traffic we receive!)
Start from small test. Look at the test logs and see what they look like. The last time I used K6 to test our Loki cluster was a while ago, and I unfortunately did not pay attention to the cardinality of the logs, but if I had to guess I would say that’s likely not a problem.
You should look to implement some baseline metrics for your Loki cluster so you can understand why it’s failing. You don’t have to be elaborate, but at least some basic ones, otherwise you are just making adjustments blind. Some metrics that would be useful, in my opinion, would be CPU, memory, gc, gc time, active stream, stream created per sec. Start with small test, gradually increase load, and see what the metrics say.
I’m hesitent to change the grpc size as we are not currently experiencing rejections related to it but good to know for the future. I expect that is more log content based in nature.
For the load test I configured iirc 300 unique app names, 5 namespaces and I forget what I put for the rest but we should have a pretty good distribution in terms of log data being generated.
I enabled persistence for the wal / data directory. Now all 3 ingester pods are in a permenantly failed state. I’m guessing that if I nuke the pv the instances would come back. ( basically had the same thing before - I issued pod delete which would wipe the storage out )
Its not clear to me why the ingesters are failing. I only see the logged 503 now.
You can scroll up a bit, my earlier reply had more information.
We are running 4 ingesters, 2 cpu x 8GB memory. With average 10K logs per seconds we are at 40% memory pressure.
I haven’t really done a deep dive, but I kinda feel like the memory pressure is related to number of streams more than number of log lines you receive (logs that have the same set of labels are grouped in a stream).
my pod size is similar - 1cpu 10gb ram and we were able to push it to approx 35-40k/sec per pod. Its synthetic load vs real traffic so I’m curious to see how that scales compared to real traffic.
Thanks for your responses @tonyswumac much appreciated!