OpenTelemetry Collector: Status Code 429 - the request has been rejected because the tenant exceeded the request rate limit

mkarg · July 8, 2024, 2:15pm

I am using OpenTelemetry Collector 0.104.0 (core distribution on Windows) to accept OTLP-pushes, and have configured it to send once per ten seconds:

receivers:
  otlp:
    protocols:
      grpc:

processors:
  batch:
    timeout: 10s

exporters:
  otlphttp:
    endpoint: https://otlp-gateway-prod-eu-west-2.grafana.net/otlp
    headers:
      Authorization: Basic ...token...

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp]

Everything worked fine for some hours – until the sending application tried to push lots of measurements at once. Instantly, the collector logged the following in the Windows Event Log:

1.7204469686200826e+09	info	exporterhelper/retry_sender.go:118	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "metrics", "name": "otlphttp", "error": "Throttle (3s), error: rpc error: code = ResourceExhausted desc = error exporting items, request to https://otlp-gateway-prod-eu-west-2.grafana.net/otlp/v1/metrics responded with HTTP Status Code 429, Message=the request has been rejected because the tenant exceeded the request rate limit, set to 75 requests/s across all distributors with a maximum allowed burst of 750 (err-mimir-tenant-max-request-rate). To adjust the related per-tenant limits, configure -distributor.request-rate-limit and -distributor.request-burst-size, or contact your service administrator., Details=[]", "interval": "8.450550142s"}

Since then, nothing is getting pushed nor logged anymore!

I am a bit confused. I thought the timeout setting would prevent exactly this situation?

Any help appreciated!

jangaraj · July 8, 2024, 2:43pm

Your are sending every 10 seconds but batches are bigger => yo are more likely to reach limit 75 requests/s. I would go lower, e.g. 1s, 100ms to decrease probability of reaching of that limit.

Of course you can still reach that limit, but 429 is retryable error, so configure retries for your case and those limited records can be pushed later

github.com

open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md

# Exporter Helper

This is a helper exporter that other exporters can depend on. Today, it primarily offers queued retry capabilities.

> :warning: This exporter should not be added to a service pipeline.

## Configuration

The following configuration options can be modified:

- `retry_on_failure`
  - `enabled` (default = true)
  - `initial_interval` (default = 5s): Time to wait after the first failure before retrying; ignored if `enabled` is `false`
  - `max_interval` (default = 30s): Is the upper bound on backoff; ignored if `enabled` is `false`
  - `max_elapsed_time` (default = 300s): Is the maximum amount of time spent trying to send a batch; ignored if `enabled` is `false`. If set to 0, the retries are never stopped.
- `sending_queue`
  - `enabled` (default = true)
  - `num_consumers` (default = 10): Number of consumers that dequeue batches; ignored if `enabled` is `false`
  - `queue_size` (default = 1000): Maximum number of batches kept in memory before dropping; ignored if `enabled` is `false`
  User should calculate this as `num_seconds * requests_per_second / requests_per_batch` where:

This file has been truncated. show original

Or try to contact your Grafana Cloud support and ask to increase that limit for your account. Maybe it’s configurable limit.

mkarg · July 9, 2024, 6:18am

Jan, thank you for chiming in!

Actually I do not understand your proposal. What Grafana Cloud actually complains about is not the number of measurings per second but the number of requests per second – which IMHO should be exactly 0.1 (once every 10 seconds). So going lower (I had no limit at all before => 200ms default limit of OTel Collector, but the problem was still there) would increase the number of requests per seconds, just lowering the number of measurings per second. Or maybe Grafana Cloud’s error message is simply wrong, saying requests where actually measurings?

As both, OTel Collector and Grafana Cloud are widely used these days, I also wonder why there is no clear and proven instruction how to configure OTel Collector to never hit the Grafana Cloud default limits?

jangaraj · July 9, 2024, 6:36am

I agree with your discoveries. But I would try my idea. That OTLP is just gateway, which may have own business logic, e.g. it breaks batch into individual points for scalability - so one request can be multiple requests after gateway. Contact your (paid) Grafana support if you want to be sure.

A lot of internals of Grafana Cloud is not published, so only they know how it works under the hood. For example they have tiers, each tier may have different limits (e. g. requests limits), instances may be moving between tiers based on the load,… so pretty complex setup. That’s probably also a reason why they didn’t publish recommended setup.

mkarg · July 10, 2024, 10:01am

It seems this is a hard limitation of the free subsciption, which cannot get changed.

Unfortunately changing anything in the collector’s config helped fully - it reduces the number of such error messages, but it does not fully prevent them.

It seems we have to live with the fact that it is simply impossible to configure the collector in a way that really prevents this problem always.

jangaraj · July 10, 2024, 10:57am

Try to play with prometheusremotewrite exporter and Prometheus/Mimir Grafana Cloud endpoint. Maybe batching will help there.

What’s your datapoint max/avg rate, which you need to ingest?

mkarg · July 11, 2024, 8:16am

Actually I am seeking for some kind of “official solution”, not for “playing until it works randomly”.

In fact, the sole application pushing OTLP metrics to the OTel Collector runs once per Minute only, and returns a batch of just 62 metrics. Hence I assume the OTel Collector and or OTel JDK for Java scrape lots of information by default, which I have not checked yet.

jangaraj · July 11, 2024, 8:17am

Then only the support is the right place to ask.

Topic		Replies	Views
Grafana cloud otlp ingestion rate limits Grafana Cloud opentelemetry	5	855	May 17, 2024
Rate limited traces metrics but account under free tier limit Grafana tempo , mimir	0	107	April 5, 2024
Grafana Free Tier RATE_LIMITED traces Grafana Cloud agent , tempo	1	1041	August 2, 2022
Response Larger Than the Max Configuration	1	1158	August 6, 2023
Promptail HTTP status 429 Grafana Loki loki	3	1426	May 3, 2024

OpenTelemetry Collector: Status Code 429 - the request has been rejected because the tenant exceeded the request rate limit

Related topics