Scalability issue: POST /loki/api/v1/push (500) Response: context canceled

carterix · April 5, 2023, 1:47pm

I have this environment:

Kubernetes client cluster with 9 workers dedicated to the test client code only.

Loki cluster on an Openshift cluster with 12 workers (each worker has 16 CPUx32GB RAM) dedicated to Loki only (6 distributors pods, 12 ingesters pods).
Distributors have this config:

        - resources:
            limits:
              cpu: '4'
              ephemeral-storage: 1Gi
              memory: 2Gi
            requests:
              cpu: '4'
              ephemeral-storage: 512Mi
              memory: 2Gi

Ingester have this config:

        - resources:
            limits:
              cpu: '2'
              ephemeral-storage: 1Gi
              memory: 20Gi
            requests:
              cpu: '1'
              ephemeral-storage: 512Mi
              memory: 1Gi

The test scenario is to simulate a continuous push with 500 Virtual Users from the kubernetes client cluster. Each VU executes a continuous push flow, where the single push is light in terms of logs pushed (around 5KBlog size) with 6 total labels high cardinality (around 10000), sleep time at each iteration 0.1 sec. Test runs for 20 minutes.

So the Kubernetes client cluster runs with 500 virtual users pushing logs to Loki cluster.
Used the k6 operator for Loki to run the test client (tried to run with 50-250 pods k6 clients).

With such number of VU the test is failing, ingesters pods are not writing any error log, while distributor pods are logging this:

│ {"caller":"logging.go:86","level":"warn","msg":"POST /loki/api/v1/push (500) 265.425µs Response: \"context canceled\\n\" ws: false; Accept: application/json; Accept-Encoding: gzip; Content-Encoding: snapp │
│ {"caller":"logging.go:86","level":"warn","msg":"POST /loki/api/v1/push (500) 247.437µs Response: \"context canceled\\n\" ws: false; Accept: application/json; Accept-Encoding: gzip; Content-Encoding: snapp │
│ {"caller":"logging.go:86","level":"warn","msg":"POST /loki/api/v1/push (500) 285.21µs Response: \"context canceled\\n\" ws: false; Accept: application/json; Accept-Encoding: gzip; Content-Encoding: snappy │

Tried to add the rate_store section to the distributor config:

"distributor":
  "ring":
    "kvstore":
      "store": "memberlist"
  "rate_store":
    "max_request_parallelism": "1000"
    "stream_rate_update_interval": "10s"
    "ingester_request_timeout: "5s"

But test is still failing with/without the rate_store config.

Tried also with lower label cardinality, but no way to run with such number of virtual users (500).
Instead when running with a few VUs (20) and big log size pushed (10MB) it is possible to get a total push size rate (100MB/s) with no errors (even for 12H of test running).

Could someone help to understand and fix the error, helping me with the loki config tuning to be able to run with 500 virtual users and light pushes (~5KB single push size)?
Thank you

tonyswumac · April 5, 2023, 4:09pm

Mind sharing your configuration? Wondering if some limit configurations or grpc send/recv size need to be tuned.

Do you see any logs from ingester? Any resource bottle neck? 2cpu/20G memory is a bit too memory skewed I’d imagine, wonder if ingester is having CPU pressure.

carterix · April 6, 2023, 9:09am

Hello Tony, thanks for questions.
There was not CPU or memory pressure on loki pods. The loki configuration of my test env is the following below. Is there any tuning you can suggest?

  config.yaml: 
    "analytics":
      "reporting_enabled": false
    "auth_enabled": true
    "chunk_store_config":
      "chunk_cache_config":
        "enable_fifocache": false
        "fifocache":
          "max_size_bytes": "500MB"
      "max_look_back_period": "0s"
    "common":
      "compactor_grpc_address": "ADDRESS1:PORT1"
    "compactor":
      "compaction_interval": 10m
      "compactor_ring":
        "kvstore":
          "store": "memberlist"
      "shared_store": "s3"
      "working_directory": "/data/loki/compactor"
      "retention_enabled": true
      "retention_delete_delay": "2h"
      "retention_delete_worker_count": 150
      "max_compaction_parallelism": 10
    "distributor":
      "ring":
        "kvstore":
          "store": "memberlist"
      "rate_store":
        "max_request_parallelism": 10000
        "stream_rate_update_interval": "10s"
        "ingester_request_timeout": "5s"
    "frontend":
      "compress_responses": true
      "scheduler_address": "ADDRESS2:PORT1"
      "tail_proxy_url": "ADDRESS3:PORT2"
    "frontend_worker":
      "grpc_client_config":
        "max_send_msg_size": 104857600
      "match_max_concurrent": true
      "scheduler_address": "ADDRESS2:PORT1"
    "ingester":
      "chunk_block_size": 262144
      "chunk_encoding": "snappy"
      "chunk_idle_period": "1m"
      "chunk_retain_period": "30s"
      "chunk_target_size": 2097152
      "max_chunk_age": "5m"
      "lifecycler":
        "heartbeat_period": "5s"
        "interface_names":
          - "eth0"
        "join_after": "0s"
        "num_tokens": 512
        "ring":
          "heartbeat_timeout": "1m"
          "kvstore":
            "store": "memberlist"
      "max_transfer_retries": 0
      "wal":
        "dir": "/data/loki/wal"
        "enabled": false
        "replay_memory_ceiling": "100MB"
        "checkpoint_duration": "5m"
    "ingester_client":
      "grpc_client_config":
        "max_recv_msg_size": 67108864
      "remote_timeout": "5s"
    "limits_config":
      "cardinality_limit": 100000
      "creation_grace_period": "10m"
      "enforce_metric_name": false
      "ingestion_burst_size_mb": 1000
      "ingestion_rate_mb": 1000
      "ingestion_rate_strategy": "global"
      "max_cache_freshness_per_query": "10m"
      "max_chunks_per_query": 2000000
      "max_entries_limit_per_query": 5000
      "max_global_streams_per_user": 0
      "max_label_name_length": 1024
      "max_label_names_per_series": 30
      "max_label_value_length": 2048
      "max_line_size": 0
      "max_query_length": "721h"
      "max_query_parallelism": 16
      "max_query_series": 500
      "max_streams_per_user": 0
      "per_stream_rate_limit": "200MB"
      "per_stream_rate_limit_burst": "1000MB"
      "reject_old_samples": true
      "reject_old_samples_max_age": "24h"
      "split_queries_by_interval": "30m"
      "retention_period": "72h"
    "memberlist":
      "abort_if_cluster_join_fails": false
      "bind_port": PORT3
      "join_members":
        - "ADDRESS4:PORT3"
      "max_join_backoff": "1m"
      "max_join_retries": 10
      "min_join_backoff": "1s"
    "querier":
      "engine":
        "max_look_back_period": "30s"
        "timeout": "3m"
      "extra_query_delay": "0s"
      "max_concurrent": 2
      "query_ingesters_within": "3h"
      "query_timeout": "1h"
      "tail_max_duration": "1h"
    "query_range":
      "align_queries_with_step": true
      "cache_results": true
      "max_retries": 5
      "parallelise_shardable_queries": false
      "results_cache":
        "cache":
          "enable_fifocache": true
          "fifocache":
            "max_size_bytes": "500MB"
    "query_scheduler":
      "max_outstanding_requests_per_tenant": 256
    "schema_config":
      "configs":
        - "from": "2020-10-01"
          "index":
            "period": "24h"
            "prefix": "loki_index_"
          "object_store": "s3"
          "schema": "v11"
          "store": "boltdb-shipper"
    "server":
      "graceful_shutdown_timeout": "5s"
      "grpc_server_max_concurrent_streams": 1000
      "grpc_server_max_recv_msg_size": 104857600
      "grpc_server_max_send_msg_size": 104857600
      "grpc_server_min_time_between_pings": "10s"
      "grpc_server_ping_without_stream_allowed": true
      "http_listen_port": PORT2
      "http_server_idle_timeout": "120s"
      "http_server_write_timeout": "600s"
      "http_server_read_timeout": "600s"
      "log_format": "json"
      "log_level": "debug"
    "storage_config":
      "boltdb_shipper":
        "active_index_directory": "/data/loki/index"
        "cache_location": "/data/loki/index_cache"
        "cache_ttl": "24h"
        "index_gateway_client":
          "server_address": "ADDRESS5:PORT1"
        "resync_interval": "5m"
        "shared_store": "s3"
  overrides.yaml: '{}'

tonyswumac · April 6, 2023, 4:21pm

Couple of thoughts:

"cardinality_limit": 100000: perhaps try an even higher number, like 1000000 or 500000.
"ingestion_rate_strategy": "global": this means that the rate limit configurations are shared globally, so the more ingesters you deploy the less burst and overall traffic they are allowed to take. Personally I have this set to local:

ingestion_burst_size_mb: 200
ingestion_rate_mb: 100
ingestion_rate_strategy: local

These two should have their values swapped I think?

"per_stream_rate_limit": "200MB"
"per_stream_rate_limit_burst": "1000MB"

Since we don’t really know why it’s not working, I am kinda just throwing stuff out blindly. I’d say since it’s your test environment maybe go crazy and crank up a bunch of settings and see if you can brute force it. Also some sort of debug log on ingester or distributor (if you are running distributor) might pinpoint the reason for you.

carterix · April 13, 2023, 1:00pm

Thanks for the suggestions, we’ll try at next attempts.
Do you know where the error “context canceled” come from and what does it means?

│ {"caller":"logging.go:86","level":"warn","msg":"POST /loki/api/v1/push (500) 265.425µs Response: \"context canceled\\n\" ws: false; Accept: application/json; Accept-Encoding: gzip; Content-Encoding: snapp │

tonyswumac · April 13, 2023, 4:27pm

In my experience it usually means requests are flat out declined by Loki because of some sort of limits. I usually see that in our clusters when I send a query that would return too much data.

carterix · April 20, 2023, 12:52pm

Yes I’m getting this type of error “context canceled” on querier components also when running queries in some conditions (query time range not short, with some concurrency, etc).
It would be helpful to have in the context canceled error message also the reason, understanding which config limit has been overcome.

ewelch · April 20, 2023, 1:24pm

As of Go 1.20 this is finally possible but it will take some time for us or others to improve Loki with these error messages.

As a general comment I will say the k6 load test we built is not a very good representation of typical logging data, it effectively tries to send maximum volume for every VU for every stream.

It’s quite hard to build a synthetic benchmark that recreates the kinds of workloads Loki typically sees, which is a combination of both high and low volume streams spread over many labels.

To try to get your setup to work, I would suggest a very small number of labels and make sure they are the same between workers (don’t randomly generate labels, use fixed labels that are the same for all workers)

Your CPU limit of 2 is also probably a little low and I suspect the errors you are seeing is because your ingesters are unresponsive because of being CPU limited, just a guess here though. Also with the high cardinality you have and the way the k6 load test works it’s going to try to fill every stream to a maximum value which is gonna have a really high memory demand for the ingesters. Again this kind of workload is not something we really see in practice, unfortunately the k6 load test just isn’t a very good real world approximation.

carterix · April 20, 2023, 2:03pm

Hello ewelch, thanks for your answer and suggestions. Yes it would be great to have this errors messages improved.
Btw, I moved to another cluster configuration with more resources (more nodes, CPU and memory ). After pushing a low and fixed number of labels as you said (around 100 total cardinality) I’m now experimenting also on this dataset. Can be this cardinality considered low, what do you think?

const labelPool = loki.Labels({
  "instance": ["localhost"],
  "format":["apache_common","apache_combined","apache_error","rfc3164","rfc5424","json","logfmt"],
  "os":["linux", "windows"],
  "word": ["abc1","abc2"],
  "container": ["cnt1","cnt2","cnt3","cnt4"],
});

ewelch · April 20, 2023, 2:48pm

That seems pretty reasonable but it still looks like it will generate I think 8 factorial streams which is about 40,000 streams, and each of those streams it will try to send as much data as it can to (which could create a lot more memory pressure on ingesters than 40k streams from an actual workload may produce)

Topic		Replies	Views
Loki Distributor outbound traffic to ingester drop Grafana Loki loki	3	616	April 18, 2025
Loki Query Performance Grafana Loki loki	8	506	November 25, 2025
Loki ingester pods failing readiness probe under load Grafana Loki	11	3119	July 1, 2025
How to scale out a loki cluster? Grafana Loki	25	6697	September 28, 2022
Logs missing at loki, Grafana Loki	3	1052	September 18, 2024

Scalability issue: POST /loki/api/v1/push (500) Response: context canceled

Related topics