Scalability issue: POST /loki/api/v1/push (500) Response: context canceled

I have this environment:

  • Kubernetes client cluster with 9 workers dedicated to the test client code only.

  • Loki cluster on an Openshift cluster with 12 workers (each worker has 16 CPUx32GB RAM) dedicated to Loki only (6 distributors pods, 12 ingesters pods).
    Distributors have this config:

            - resources:
                  cpu: '4'
                  ephemeral-storage: 1Gi
                  memory: 2Gi
                  cpu: '4'
                  ephemeral-storage: 512Mi
                  memory: 2Gi

    Ingester have this config:

            - resources:
                  cpu: '2'
                  ephemeral-storage: 1Gi
                  memory: 20Gi
                  cpu: '1'
                  ephemeral-storage: 512Mi
                  memory: 1Gi

The test scenario is to simulate a continuous push with 500 Virtual Users from the kubernetes client cluster. Each VU executes a continuous push flow, where the single push is light in terms of logs pushed (around 5KBlog size) with 6 total labels high cardinality (around 10000), sleep time at each iteration 0.1 sec. Test runs for 20 minutes.

So the Kubernetes client cluster runs with 500 virtual users pushing logs to Loki cluster.
Used the k6 operator for Loki to run the test client (tried to run with 50-250 pods k6 clients).

With such number of VU the test is failing, ingesters pods are not writing any error log, while distributor pods are logging this:

│ {"caller":"logging.go:86","level":"warn","msg":"POST /loki/api/v1/push (500) 265.425µs Response: \"context canceled\\n\" ws: false; Accept: application/json; Accept-Encoding: gzip; Content-Encoding: snapp │
│ {"caller":"logging.go:86","level":"warn","msg":"POST /loki/api/v1/push (500) 247.437µs Response: \"context canceled\\n\" ws: false; Accept: application/json; Accept-Encoding: gzip; Content-Encoding: snapp │
│ {"caller":"logging.go:86","level":"warn","msg":"POST /loki/api/v1/push (500) 285.21µs Response: \"context canceled\\n\" ws: false; Accept: application/json; Accept-Encoding: gzip; Content-Encoding: snappy │

Tried to add the rate_store section to the distributor config:

      "store": "memberlist"
    "max_request_parallelism": "1000"
    "stream_rate_update_interval": "10s"
    "ingester_request_timeout: "5s"

But test is still failing with/without the rate_store config.

Tried also with lower label cardinality, but no way to run with such number of virtual users (500).
Instead when running with a few VUs (20) and big log size pushed (10MB) it is possible to get a total push size rate (100MB/s) with no errors (even for 12H of test running).

Could someone help to understand and fix the error, helping me with the loki config tuning to be able to run with 500 virtual users and light pushes (~5KB single push size)?
Thank you

Mind sharing your configuration? Wondering if some limit configurations or grpc send/recv size need to be tuned.

Do you see any logs from ingester? Any resource bottle neck? 2cpu/20G memory is a bit too memory skewed I’d imagine, wonder if ingester is having CPU pressure.

Hello Tony, thanks for questions.
There was not CPU or memory pressure on loki pods. The loki configuration of my test env is the following below. Is there any tuning you can suggest?

      "reporting_enabled": false
    "auth_enabled": true
        "enable_fifocache": false
          "max_size_bytes": "500MB"
      "max_look_back_period": "0s"
      "compactor_grpc_address": "ADDRESS1:PORT1"
      "compaction_interval": 10m
          "store": "memberlist"
      "shared_store": "s3"
      "working_directory": "/data/loki/compactor"
      "retention_enabled": true
      "retention_delete_delay": "2h"
      "retention_delete_worker_count": 150
      "max_compaction_parallelism": 10
          "store": "memberlist"
        "max_request_parallelism": 10000
        "stream_rate_update_interval": "10s"
        "ingester_request_timeout": "5s"
      "compress_responses": true
      "scheduler_address": "ADDRESS2:PORT1"
      "tail_proxy_url": "ADDRESS3:PORT2"
        "max_send_msg_size": 104857600
      "match_max_concurrent": true
      "scheduler_address": "ADDRESS2:PORT1"
      "chunk_block_size": 262144
      "chunk_encoding": "snappy"
      "chunk_idle_period": "1m"
      "chunk_retain_period": "30s"
      "chunk_target_size": 2097152
      "max_chunk_age": "5m"
        "heartbeat_period": "5s"
          - "eth0"
        "join_after": "0s"
        "num_tokens": 512
          "heartbeat_timeout": "1m"
            "store": "memberlist"
      "max_transfer_retries": 0
        "dir": "/data/loki/wal"
        "enabled": false
        "replay_memory_ceiling": "100MB"
        "checkpoint_duration": "5m"
        "max_recv_msg_size": 67108864
      "remote_timeout": "5s"
      "cardinality_limit": 100000
      "creation_grace_period": "10m"
      "enforce_metric_name": false
      "ingestion_burst_size_mb": 1000
      "ingestion_rate_mb": 1000
      "ingestion_rate_strategy": "global"
      "max_cache_freshness_per_query": "10m"
      "max_chunks_per_query": 2000000
      "max_entries_limit_per_query": 5000
      "max_global_streams_per_user": 0
      "max_label_name_length": 1024
      "max_label_names_per_series": 30
      "max_label_value_length": 2048
      "max_line_size": 0
      "max_query_length": "721h"
      "max_query_parallelism": 16
      "max_query_series": 500
      "max_streams_per_user": 0
      "per_stream_rate_limit": "200MB"
      "per_stream_rate_limit_burst": "1000MB"
      "reject_old_samples": true
      "reject_old_samples_max_age": "24h"
      "split_queries_by_interval": "30m"
      "retention_period": "72h"
      "abort_if_cluster_join_fails": false
      "bind_port": PORT3
        - "ADDRESS4:PORT3"
      "max_join_backoff": "1m"
      "max_join_retries": 10
      "min_join_backoff": "1s"
        "max_look_back_period": "30s"
        "timeout": "3m"
      "extra_query_delay": "0s"
      "max_concurrent": 2
      "query_ingesters_within": "3h"
      "query_timeout": "1h"
      "tail_max_duration": "1h"
      "align_queries_with_step": true
      "cache_results": true
      "max_retries": 5
      "parallelise_shardable_queries": false
          "enable_fifocache": true
            "max_size_bytes": "500MB"
      "max_outstanding_requests_per_tenant": 256
        - "from": "2020-10-01"
            "period": "24h"
            "prefix": "loki_index_"
          "object_store": "s3"
          "schema": "v11"
          "store": "boltdb-shipper"
      "graceful_shutdown_timeout": "5s"
      "grpc_server_max_concurrent_streams": 1000
      "grpc_server_max_recv_msg_size": 104857600
      "grpc_server_max_send_msg_size": 104857600
      "grpc_server_min_time_between_pings": "10s"
      "grpc_server_ping_without_stream_allowed": true
      "http_listen_port": PORT2
      "http_server_idle_timeout": "120s"
      "http_server_write_timeout": "600s"
      "http_server_read_timeout": "600s"
      "log_format": "json"
      "log_level": "debug"
        "active_index_directory": "/data/loki/index"
        "cache_location": "/data/loki/index_cache"
        "cache_ttl": "24h"
          "server_address": "ADDRESS5:PORT1"
        "resync_interval": "5m"
        "shared_store": "s3"
  overrides.yaml: '{}'

Couple of thoughts:

  1. "cardinality_limit": 100000: perhaps try an even higher number, like 1000000 or 500000.

  2. "ingestion_rate_strategy": "global": this means that the rate limit configurations are shared globally, so the more ingesters you deploy the less burst and overall traffic they are allowed to take. Personally I have this set to local:

ingestion_burst_size_mb: 200
ingestion_rate_mb: 100
ingestion_rate_strategy: local
  1. These two should have their values swapped I think?
"per_stream_rate_limit": "200MB"
"per_stream_rate_limit_burst": "1000MB"

Since we don’t really know why it’s not working, I am kinda just throwing stuff out blindly. I’d say since it’s your test environment maybe go crazy and crank up a bunch of settings and see if you can brute force it. Also some sort of debug log on ingester or distributor (if you are running distributor) might pinpoint the reason for you.

Thanks for the suggestions, we’ll try at next attempts.
Do you know where the error “context canceled” come from and what does it means?

│ {"caller":"logging.go:86","level":"warn","msg":"POST /loki/api/v1/push (500) 265.425µs Response: \"context canceled\\n\" ws: false; Accept: application/json; Accept-Encoding: gzip; Content-Encoding: snapp │

In my experience it usually means requests are flat out declined by Loki because of some sort of limits. I usually see that in our clusters when I send a query that would return too much data.

1 Like

Yes I’m getting this type of error “context canceled” on querier components also when running queries in some conditions (query time range not short, with some concurrency, etc).
It would be helpful to have in the context canceled error message also the reason, understanding which config limit has been overcome.

As of Go 1.20 this is finally possible but it will take some time for us or others to improve Loki with these error messages.

As a general comment I will say the k6 load test we built is not a very good representation of typical logging data, it effectively tries to send maximum volume for every VU for every stream.

It’s quite hard to build a synthetic benchmark that recreates the kinds of workloads Loki typically sees, which is a combination of both high and low volume streams spread over many labels.

To try to get your setup to work, I would suggest a very small number of labels and make sure they are the same between workers (don’t randomly generate labels, use fixed labels that are the same for all workers)

Your CPU limit of 2 is also probably a little low and I suspect the errors you are seeing is because your ingesters are unresponsive because of being CPU limited, just a guess here though. Also with the high cardinality you have and the way the k6 load test works it’s going to try to fill every stream to a maximum value which is gonna have a really high memory demand for the ingesters. Again this kind of workload is not something we really see in practice, unfortunately the k6 load test just isn’t a very good real world approximation.

Hello ewelch, thanks for your answer and suggestions. Yes it would be great to have this errors messages improved.
Btw, I moved to another cluster configuration with more resources (more nodes, CPU and memory ). After pushing a low and fixed number of labels as you said (around 100 total cardinality) I’m now experimenting also on this dataset. Can be this cardinality considered low, what do you think?

const labelPool = loki.Labels({
  "instance": ["localhost"],
  "os":["linux", "windows"],
  "word": ["abc1","abc2"],
  "container": ["cnt1","cnt2","cnt3","cnt4"],

That seems pretty reasonable :+1: but it still looks like it will generate I think 8 factorial streams which is about 40,000 streams, and each of those streams it will try to send as much data as it can to (which could create a lot more memory pressure on ingesters than 40k streams from an actual workload may produce)