Hi,
I’m looking for guidance on best practices for collecting custom OpenTelemetry metrics from AWS Lambda and sending them to Grafana Mimir.
Setup:
- Grafana Mimir deployed in distributed mode
- Installed via Helm chart 5.8.0
- Running on AWS EKS
- Using AWS Fargate for compute
- Using S3 for block storage
The current data flow looks like this:
- Lambda → ADOT layer / Alloy Lambda layer → Central Alloy → Mimir
In other words, metrics are emitted from Lambda, passed through an OpenTelemetry-based Lambda layer, forwarded to a centralized Alloy deployment, and then remote-written to Mimir.
Problem:
We have a Lambda-based system where multiple instances of the same Lambda function may be provisioned at the same time, depending on load.
Under higher load, multiple Lambda instances often emit the same metric with the same timestamp, but with different cumulative values, because they are coming from different Lambda runtime instances.
This causes Mimir to reject samples with errors:
- err-mimir-sample-duplicate-timestamp
Why this happens:
From what I understand, the issue is that multiple short-lived Lambda instances are writing samples for what Mimir sees as the same time series, at the same timestamp, but with different values.
Workaround:
One possible fix is to add a label that identifies the Lambda instance so that each runtime instance gets its own distinct time series.
Cardinality concern:
- Lambda instances are short-lived
- Instance identities change constantly
- This may be manageable for a small number of functions, but it becomes much more concerning when you have thousands of Lambdas emitting metrics
So the tradeoff seems to be:
- No instance label → duplicate timestamp conflicts
- Add instance label → potentially very high cardinality
Main question:
What is the current best practice for handling OTel metrics with Mimir in Lambda-based architectures, especially at scale?
Any guidance would be appreciated.