Ensuring Prometheus High Availability for Latency-based HPA

Service latency is a critical metric that directly impacts user experience, SLA adherence, and ultimately business revenue.
While CPU or memory usage shows the internal state of the system, latency reflects how users actually experience the service.

For this reason, we built latency-based HPA (Horizontal Pod Autoscaling) using Prometheus’ probe_http_duration_seconds metric.
Instead of scaling based solely on CPU usage, the system reacts immediately to increases in service latency by scaling Pods.
This allows us to handle sudden traffic surges more effectively while also optimizing resource utilization.


The Hidden Risk: Prometheus as a Single Point of Failure (SPOF)

The core assumption behind latency-based HPA is that Prometheus must always be available.
However, with its single binary and local TSDB (Time Series Database) design, Prometheus faces some serious limitations:

  • Single Point of Failure (SPOF): If the Prometheus instance crashes, metric collection stops and HPA loses its scaling signal.
  • High Memory Consumption: High-cardinality labels quickly increase memory usage and may lead to OOM (Out of Memory) errors.
  • Scalability Challenges: Prometheus is optimized for single-node setups, which makes managing multiple instances and centralizing data difficult at scale.
  • Lack of Disaster Recovery: Without remote storage, long-term retention and recovery are nearly impossible.

In this scenario, a Prometheus outage directly translates into a service outage.


Real-world Failure Scenario

Consider the following situation:

  1. A sudden traffic surge pushes service latency beyond the defined threshold.
  2. At the same moment, Prometheus crashes due to memory overload or an unexpected error.
  3. HPA no longer receives scaling signals.
  4. Pod count remains static while traffic keeps growing → leading to a service outage.

This scenario clearly shows why Prometheus cannot remain as a single point of failure.


Memory Optimization Strategies

Prometheus memory usage is primarily driven by:

  • Number of active series
  • Label cardinality
  • Retention period

Effective strategies include:

  • Adjust TSDB retention (prometheusSpec.retention): Keep only the data you truly need, reducing memory and disk overhead.
    • For even more efficient improvement, consider adopting solutions like Mimir that provide high availability and long-term storage.
  • Refine scrape configurations (prometheusSpec.additionalScrapeConfigs): Avoid collecting unnecessary or overly granular metrics to reduce series cardinality at the source.
  • Use recording rules selectively: Apply them only to computationally expensive queries to reduce query overhead.
  • Integrate Remote Write: Offload long-term data into external storage systems to reduce Prometheus memory usage.

Prometheus Extension Solutions: Achieving High Availability

To overcome these limitations and ensure HA, long-term retention, and disaster recovery, several extension solutions can be adopted:

  • Thanos: A stable solution combining Prometheus Sidecar with object storage to provide HA and long-term retention.
  • Mimir: An open-source project by Grafana Labs, designed to handle billions of time series with extreme scalability and high availability.
  • Cortex: A distributed metric storage system focused on multi-tenancy, still used by many enterprises.
  • VictoriaMetrics: A lightweight, high-performance alternative to Prometheus, offering cost-efficient operations.

Conclusion

Latency-based HPA is a powerful approach — but it only works if Prometheus itself remains reliable.
Without HA and memory optimization, Prometheus can become the very source of service outages.

When running Prometheus at scale, what was the first major issue you encountered?

  • Memory pressure (OOM)
  • Federation complexity
  • SPOF risk
  • Long-term data retention