Prometheus and Grafana Best Practice

We have a number of different services in kubernetes that we would like to setup monitoring for. I’ve found that a lot of helm charts have options to install prometheus and grafana that are pre-configured for that service. This makes the setup of monitoring for that service easier. This approach also means that you end up with multiple copies of prometheus and grafana running.

My question is, within a kubernetes cluster what is the best practice for installing prometheus and grafana?

  1. Should there be a single installation of prometheus and grafana that all other services register with, or

  2. Should you allow each service to install their own instance of prometheus and grafana.

My initial thoughts would be to use a single installation that all other services would register their metrics endpoints with but I’m not sure if this is correct? Are there limitations with using a single installation that I need to be aware of?

  1. For small system, it’s enough to have single installation. If you need redundancy, please check Thanos for Prometheus installation, also for Grafana. My experience, single installation for grafana is enough for more than 500 physical machines in my cluster, also I have daily dump of the grafana mysql for backup

  2. Please see #1:grinning:

The limitation is merely in Prometheus that need to check the disk usage, you can build dashboard for this to monitor the Prometheus disk usage. Check the default location /var/lib/prometheus or other location that reside in the config.
For Grafana, if the concurrent users that connect to Grafana less than 100, you just need single instance with adequate memory and CPU only. I suggest use mysql as grafana backend.

Regards,
Fadjar Tandabawana

Thanks Fadjar for the feedback.

Our K8 cluster is locally hosted and I can’t see us having more than 10 worker nodes in the cluster in the near future so it sounds like a single instance of prometheus & grafana will be adequate.

I’ll look into the prometheus storage - I assume the disk storage you need is a function of how many metrics you are planning on scraping?

Indeed…

Below my Prometheus dashboard for 90 days in the small cluster.

Thanks Fadjar - that gives me a good place to start with the initial storage size :+1: