Best practice for OTel metrics from AWS Lambda with Mimir to avoid duplicate timestamp errors without exploding cardinality

Hi,

I’m looking for guidance on best practices for collecting custom OpenTelemetry metrics from AWS Lambda and sending them to Grafana Mimir.

Setup:

  • Grafana Mimir deployed in distributed mode
  • Installed via Helm chart 5.8.0
  • Running on AWS EKS
  • Using AWS Fargate for compute
  • Using S3 for block storage

The current data flow looks like this:

  • Lambda → ADOT layer / Alloy Lambda layer → Central Alloy → Mimir

In other words, metrics are emitted from Lambda, passed through an OpenTelemetry-based Lambda layer, forwarded to a centralized Alloy deployment, and then remote-written to Mimir.

Problem:

We have a Lambda-based system where multiple instances of the same Lambda function may be provisioned at the same time, depending on load.

Under higher load, multiple Lambda instances often emit the same metric with the same timestamp, but with different cumulative values, because they are coming from different Lambda runtime instances.

This causes Mimir to reject samples with errors:

  • err-mimir-sample-duplicate-timestamp

Why this happens:

From what I understand, the issue is that multiple short-lived Lambda instances are writing samples for what Mimir sees as the same time series, at the same timestamp, but with different values.

Workaround:

One possible fix is to add a label that identifies the Lambda instance so that each runtime instance gets its own distinct time series.

Cardinality concern:

  • Lambda instances are short-lived
  • Instance identities change constantly
  • This may be manageable for a small number of functions, but it becomes much more concerning when you have thousands of Lambdas emitting metrics

So the tradeoff seems to be:

  • No instance label → duplicate timestamp conflicts
  • Add instance label → potentially very high cardinality

Main question:

What is the current best practice for handling OTel metrics with Mimir in Lambda-based architectures, especially at scale?

Any guidance would be appreciated.

I think this is really the only solution. As for your concern with cardinality, how many runtime instances do you have at any given time?

At any given time, I’ve seen up to 3 runtime instances for a single Lambda. However, that is for a Lambda that is not heavily used compared with what we would expect other production lambdas will behave.

The bigger concern is that the instances are short-lived. A given instance ID may only be active for around 6 to 10 minutes, so after an hour you could already have 20 to 30 distinct time series.

Those can be aggregated when creating the query, but even as a starting point, that feels concerning when thinking about supporting a large number of Lambdas. It seems especially concerning for use cases where someone wants to query data over the span of weeks.

I would expect that to result in slower queries, along with increased resource utilization.

I think that might be ok. Mimir has a config that enables cardinality monitoring, might be a good idea to turn it on.

If it becomes a problem, what you can do then is to use a recording rule and get rid of the unique runtime ID with an aggregation query. Might lose some resolution though.