Loki ingester pods failing readiness probe under load

I’ve been running some load test scenarios against loki for capacity planning / sizing purposes. We are using the loki-distributed helm chart version 0.78.5 and loki software version 2.9.6 which I believe is the chart version.

For some reason when under load the ingester pods are getting restarted when under load at certain levels. For example 3 ingester pods with 4 cpu and 8gb of ram hitting an ingest rate of approx 200k/sec.

From the kubernetes event stream I see:

Readiness probe failed: HTTP probe failed with statuscode: 503

I found / read over The essential config settings you should use so you won’t drop logs in Loki | Grafana Labs - all of the settings recommended there seem to be covered by the helm chart - I did not have to set them.

We pass some threshold in terms of ingest rate and the ingester pods start failing their readiness probe. Even after the load test has ended the ingester pods are still in this state for some time - its been 20 minutes and I’m still waiting to see things return to happy state. We have the autoscale configuration set for both distributor and ingester pods at the 80% cpu threshold chart default and we do scale up both distributor and ingester pods during this test.

Any recommendations on how to go about debugging this? Any thoughts on configuration settings which I should be looking at?

When under load I noticed this which seems like a long time for a single pod to be checkpointing is this pointing at a bottleneck / config issue?

level=info ts=2024-06-21T14:09:22.526270446Z caller=checkpoint.go:569 msg=“checkpoint done” time=4m30.072119284s

So its been probably 40 minutes since the test ended. One of the ingest pods is just not happy

curl -v http://localhost:15020/app-health/ingester/livez
*   Trying 127.0.0.1:15020...
* Connected to localhost (127.0.0.1) port 15020 (#0)
> GET /app-health/ingester/livez HTTP/1.1
> Host: localhost:15020
> User-Agent: curl/7.81.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< Date: Fri, 21 Jun 2024 15:15:28 GMT
< Content-Length: 0
<
* Connection #0 to host localhost left intact

The pod did not log anything when I issued the request for liveness or readiness probes. They are configured as:

( from kubectl describe on the pod )

    Liveness:   http-get http://:15020/app-health/ingester/livez delay=300s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:15020/app-health/ingester/readyz delay=30s timeout=1s period=10s #success=1 #failure=3

This is with little / no load. The log ingest rate is in the low 10s per second

Out of curiosity I hopped into the pod. Top shows something that may be helpful?

Mem: 11878140K used, 121960K free, 4112K shrd, 1632K buff, 9391788K cached
CPU:  57% usr   4% sys   0% nic  27% idle  10% io   0% irq   0% sirq
Load average: 1.02 0.78 0.79 3/406 19
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    1     0 loki     S    2927m  24%   1  57% /usr/bin/loki -config.file=/etc/loki/config/config.yaml -target=ingester
   14     0 loki     S     1792   0%   0   0% sh
   19    14 loki     R     1712   0%   1   0% top

the resources block for the ingester pod:

    resources:
      limits:
        cpu: 4
        memory: 8Gi
      requests:
        cpu: 1
        memory: 8Gi

Does this configuration seem sensible for an ingester pod? I have not found much in terms of sizing guides

Let me try to address your questions one at a time:

  1. First, you never mentioned why your ingesters failed in the first place. CPU pressure or memory pressure?

  2. You also did not share your test data. What do the test logs look like? How many labels? What is the level of cardinality of the labels?

  3. Unless you are running ingesters on Kubernetes as stateful container, you may want to configure ingester to auto forget unhealthy containers. The reason is ingester (actually readers as well, if I remember correctly) expect a certain number of containers healthy, otherwise it won’t function at all. When your containers are under pressure and fail constantly you will have a bunch of unhealthy containers listed in membership, which can prevent future containers from starting up. You can also try to speed this up by using the Loki container UI (or API call) and remove the unhealthy target.

We don’t have a big cluster, for us we run 4 ingesters, each with 2cpu x 8GB memory. We are taking on average 7K to 10K logs per second (roughly 1MB per second), around 4500 streams (number of log streams with unique set of labels), with CPU and memory pressure at 40%. It’s quite a bit over provisioned, but we do have burst traffic that goes up to 300K logs per second (happens couple of times a day), so we kinda have to be prepared for that. Allocating resources for ingesters can be tricky, because it really depends on the nature of your logs (big, small, number of labels, etc), and how your users send logs to Loki (burst traffic or stable traffic).

If you are running on Kubernetes I believe it’s generally recommended to have smaller but more number of ingesters. You may also consider posting this question on the Grafana community Slack, there are people there who are running much bigger cluster than I who may be able to give you better information.

So the reason the ingester pod is failing is not clear to me. It does not appear to be memory related - the pod is not getting oom killed. Its just failing to process events

As for the test data. This is synthetic data generated by grafana/xk6-loki: k6 extension for Loki (github.com) with the data being between 4 and 5kb in size - its purely synthetic I have no idea what it even looks like!

I’m using the helm-charts/charts/loki-distributed at main · grafana/helm-charts (github.com) chart so everything is broken out into microservices.

The ingester pods simply have their liveness probe check failing - its not clear what is failed / why. Here are the config overrides we are specifying for the chart:

    structuredConfig:
      frontend:
        log_queries_longer_than: 5s
        query_stats_enabled: true
      memberlist:
        # https://github.com/grafana/mimir/issues/2865
        cluster_label: "loki"
      ingester:
        autoforget_unhealthy: true
      compactor:
        working_directory: /tmp/retention
        shared_store: s3
        delete_request_store: s3
        retention_enabled: true
      limits_config:
        max_query_lookback: 90d
        retention_period: 90d
        ingestion_rate_mb: 400
        ingestion_burst_size_mb: 400
        per_stream_rate_limit: 100mb
        per_stream_rate_limit_burst: 400mb
        # Tweaks to querier limits. We keep getting OOM
        max_chunks_per_query: 500000  # default is 2000000
        max_query_series: 50  # default is 500
        max_streams_matchers_per_query: 100  # default is 1000

We have the ingester pod configured to autoscale at 80% cpu. Each ingester pod gets 4 cpus and 8gb of ram. I’ve thought about scaling the pods down to 2 cpu to see if that performs better - like you say a larger number of smaller pods.

Any thoughts? Especially around the config overrides - we’ve played with a few settings based on the synthetic load testing but we know they are only going to get us in the ballpark of where we will need to be for real load ( and yes I know many settings will just depend on the actual traffic we receive!)

Couple of suggestions I’d make:

  1. Tune the grpc size a bit. This is what we use:
server:
  # 100MB
  grpc_server_max_recv_msg_size: 1.048576e+08
  grpc_server_max_send_msg_size: 1.048576e+08
  1. Start from small test. Look at the test logs and see what they look like. The last time I used K6 to test our Loki cluster was a while ago, and I unfortunately did not pay attention to the cardinality of the logs, but if I had to guess I would say that’s likely not a problem.

  2. You should look to implement some baseline metrics for your Loki cluster so you can understand why it’s failing. You don’t have to be elaborate, but at least some basic ones, otherwise you are just making adjustments blind. Some metrics that would be useful, in my opinion, would be CPU, memory, gc, gc time, active stream, stream created per sec. Start with small test, gradually increase load, and see what the metrics say.

I’m hesitent to change the grpc size as we are not currently experiencing rejections related to it but good to know for the future. I expect that is more log content based in nature.

For the load test I configured iirc 300 unique app names, 5 namespaces and I forget what I put for the rest but we should have a pretty good distribution in terms of log data being generated.

So I set the log level to debug. Now I see:

level=debug ts=2024-06-26T18:55:00.938700417Z caller=logging.go:118 traceID=78d017f4308edbf8 orgID=fake msg="GET /ready (503) 57.541µs"
(...)
level=debug ts=2024-06-26T18:55:40.938885656Z caller=logging.go:118 traceID=540247a7a07cfd7a orgID=fake msg="GET /ready (503) 56.081µs"
level=debug ts=2024-06-26T18:55:43.788662613Z caller=logging.go:118 traceID=35b0e0ddc1f61a12 orgID=fake msg="GET /ready (503) 52.643µs"

I enabled persistence for the wal / data directory. Now all 3 ingester pods are in a permenantly failed state. I’m guessing that if I nuke the pv the instances would come back. ( basically had the same thing before - I issued pod delete which would wipe the storage out )

Its not clear to me why the ingesters are failing. I only see the logged 503 now.

Out of curiosity what are your ingester pod sizes and what kind of throughput are you getting out of your configuration?

I found that the ingester pod got killed due to oom. I was running with 8gb ingesters - now at 16. Single cpu core

You can scroll up a bit, my earlier reply had more information.

We are running 4 ingesters, 2 cpu x 8GB memory. With average 10K logs per seconds we are at 40% memory pressure.

I haven’t really done a deep dive, but I kinda feel like the memory pressure is related to number of streams more than number of log lines you receive (logs that have the same set of labels are grouped in a stream).

my pod size is similar - 1cpu 10gb ram and we were able to push it to approx 35-40k/sec per pod. Its synthetic load vs real traffic so I’m curious to see how that scales compared to real traffic.

Thanks for your responses @tonyswumac much appreciated!