Hi all,
We are having an issue on the Mimir deployment in our prod. We’ve set the out_of_order_time_window
to 20 minutes, but for some reason, some Alloy agents running on different clusters are started sending outdated metrics. As a result, we are seeing a high volume of errors in both the ingester and distributor.
error on the ingesters are like: The sample has been rejected because another sample with a more recent timestamp has already been ingested, and this sample is beyond the out-of-order time window of 20m (err-mimir-sample-timestamp-too-old).
These errors are causing the distributor pods to get OOM and continuously restart. Increasing the memory limit doesn’t resolve the problem, as it eventually fills up again.
I’ve tried increasing the out_of_order_time_window
, but I’m still receiving metrics older than the threshold. Could you advise on how to resolve this issue and prevent it from happening again? Also, how can we protect the distributor from being overwhelmed by outdated data?
P.S: I want to make sure that I am safeguarding my Distributer deployment. I know I can change tha alloy agent sample_age_limit to not send outdated data.