Hi,
Prometheus container image+version: prom/prometheus:v3.0.0
We’re running an AWS EKS cluster with 3 worker nodes with an APM set up (OTEL, Grafana, Jaeger & Cloudwatch log).
We’ve managed to successfully make all services highly availably except for Prometheus where its only running a single pod/container as we had routinely received issues about multiple pods unable to take a hold of the time series database that Prometheus has and the pods would shut down.
In order to combat the problem, we configured Thanos where metrics older than 2 hours in Prometheus are pushed into an AWS S3 bucket automagically and said metrics are retrieved through the Thanos query pods (not issues here, it works perfectly).
If Prometheus has issues and stops or if the node it’s running on falls over, there’ll be a brief moment in time where Prom is down as it’s currently only on a single pod/container.
Is anyone able to advise if there’s a certain AWS Service or backend storage that allows Prometheus to run on multiple pods without issues? We currently host the Prometheus TSBDB on AWS EFS (I know it’s not recommended)
Cheers!