OpenTelemetry Collector: Status Code 429 - the request has been rejected because the tenant exceeded the request rate limit

I am using OpenTelemetry Collector 0.104.0 (core distribution on Windows) to accept OTLP-pushes, and have configured it to send once per ten seconds:

receivers:
  otlp:
    protocols:
      grpc:

processors:
  batch:
    timeout: 10s

exporters:
  otlphttp:
    endpoint: https://otlp-gateway-prod-eu-west-2.grafana.net/otlp
    headers:
      Authorization: Basic ...token...

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp]

Everything worked fine for some hours – until the sending application tried to push lots of measurements at once. Instantly, the collector logged the following in the Windows Event Log:

1.7204469686200826e+09	info	exporterhelper/retry_sender.go:118	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "metrics", "name": "otlphttp", "error": "Throttle (3s), error: rpc error: code = ResourceExhausted desc = error exporting items, request to https://otlp-gateway-prod-eu-west-2.grafana.net/otlp/v1/metrics responded with HTTP Status Code 429, Message=the request has been rejected because the tenant exceeded the request rate limit, set to 75 requests/s across all distributors with a maximum allowed burst of 750 (err-mimir-tenant-max-request-rate). To adjust the related per-tenant limits, configure -distributor.request-rate-limit and -distributor.request-burst-size, or contact your service administrator., Details=[]", "interval": "8.450550142s"}

Since then, nothing is getting pushed nor logged anymore! :sweat:

I am a bit confused. I thought the timeout setting would prevent exactly this situation? :thinking:

Any help appreciated! :slight_smile:

Your are sending every 10 seconds but batches are bigger => yo are more likely to reach limit 75 requests/s. I would go lower, e.g. 1s, 100ms to decrease probability of reaching of that limit.

Of course you can still reach that limit, but 429 is retryable error, so configure retries for your case and those limited records can be pushed later

Or try to contact your Grafana Cloud support and ask to increase that limit for your account. Maybe it’s configurable limit.

Jan, thank you for chiming in! :smiley:

Actually I do not understand your proposal. What Grafana Cloud actually complains about is not the number of measurings per second but the number of requests per second – which IMHO should be exactly 0.1 (once every 10 seconds). So going lower (I had no limit at all before => 200ms default limit of OTel Collector, but the problem was still there) would increase the number of requests per seconds, just lowering the number of measurings per second. Or maybe Grafana Cloud’s error message is simply wrong, saying requests where actually measurings?

As both, OTel Collector and Grafana Cloud are widely used these days, I also wonder why there is no clear and proven instruction how to configure OTel Collector to never hit the Grafana Cloud default limits?

I agree with your discoveries. But I would try my idea. That OTLP is just gateway, which may have own business logic, e.g. it breaks batch into individual points for scalability - so one request can be multiple requests after gateway. Contact your (paid) Grafana support if you want to be sure.

A lot of internals of Grafana Cloud is not published, so only they know how it works under the hood. For example they have tiers, each tier may have different limits (e. g. requests limits), instances may be moving between tiers based on the load,… so pretty complex setup. That’s probably also a reason why they didn’t publish recommended setup.

It seems this is a hard limitation of the free subsciption, which cannot get changed.

Unfortunately changing anything in the collector’s config helped fully - it reduces the number of such error messages, but it does not fully prevent them.

It seems we have to live with the fact that it is simply impossible to configure the collector in a way that really prevents this problem always. :frowning:

Try to play with prometheusremotewrite exporter and Prometheus/Mimir Grafana Cloud endpoint. Maybe batching will help there.

What’s your datapoint max/avg rate, which you need to ingest?

Actually I am seeking for some kind of “official solution”, not for “playing until it works randomly”. :wink:

In fact, the sole application pushing OTLP metrics to the OTel Collector runs once per Minute only, and returns a batch of just 62 metrics. Hence I assume the OTel Collector and or OTel JDK for Java scrape lots of information by default, which I have not checked yet.

Then only the support is the right place to ask.