Scaling Mimir Alertmanager in single-tenant distributed deployment on EKS/Fargate with high alert volume

Hi all,

I’m looking for guidance on scaling Grafana Mimir Alertmanager in a distributed-mode deployment running on AWS EKS with AWS Fargate.

Our setup is currently is a single-tenant setup. I’m trying to understand the practical scaling limits and recommended architecture for handling thousands of alerts.

Current understanding:

From what I understand, Mimir’s Alertmanager processes alerts per tenant. In a mono-tenant setup, that seems to imply that Alertmanager does not horizontally scale in the same way the ruler does, because there is only one tenant’s alert workload to process.

By contrast, the ruler component appears to scale horizontally and divided the load as more rules are created and more pods are spun up.

Problem we’re trying to solve:

We have a system that may need to handle thousands of alerts, and I’m trying to figure out the best way to scale Alertmanager capacity in that kind of setup. Where ideally it can automatically horizontally scale and divide load. Any guidance would be appreciated.

Environment:

  • Helm Chart: 5.8.0
  • Type: Distributed mode
  • Cluster type: AWS EKS
  • Compute: AWS Fargate
  • Tenant: single-tenant