Architecture advice: how to gather metrics from many places

Hey folks, I’m struggling with analysis paralysis, and I’m hoping someone more experienced can kick me through it.

I have a Grafana instance (v10.4.0) hosted in Kubernetes cluster K0. Using it, I would like to be able to view/alert on metrics (e.g. Prometheus) from several different places:

  • Kubernetes cluster K1
  • The hardware underlying cluster K1
  • Kubernetes cluster K2
  • The hardware underlying cluster K2
  • Various other pieces of hardware that aren’t hosting Kubernetes clusters (let’s call these Other)

From my research, there appears to be several different ways to approach this:

  1. Setup a single Prometheus server alongside Grafana in K0. Configure metrics exporters in K1 and K2, protected by kube-rbac-proxy. Configure the exporters on Other with a similar authenticated proxy. Have that single Prometheus instance authenticate through kube-rbac-proxy to scrape K1 and K2, as well as Other.
  2. Setup Prometheus server within K1 and K2, which are also responsible for gathering metrics from the underlying hosts. Configure basic auth on them, and configure each as a data source in Grafana. What do I do about Other, though? Do I need a third Prometheus instance somewhere? I’m drowning in network boundaries.
  3. Setup a single Prometheus server alongside Grafana in K0. Setup a pushgateway in K1 and K2, scraping metrics from the clusters and their underlying hosts. Other is setup with an authenticated proxy as in (2). Pushgateways and Other are then scraped as targets from the main Prometheus server.
  4. Setup a single Prometheus server alongside Grafana in K0. Setup Prometheus instances in agent mode within K1 and K2, monitoring the cluster and the underlying hardware. The same approach is used for Other.
  5. Setup Grafana Agent in K1, K2, and Other. Setup Mimir alongside Grafana in K0. Send metrics from the agents to Mimir, no Prometheus necessary… I think? Gah my head is hurting now. What is Mimir?

Hopefully you get an idea of why I’m paralyzed, here.

(1) sounds okay on the surface, but the Prometheus blog post introducing agent mode says this:

Scraping across network boundaries can be a challenge if it adds new unknowns in a monitoring pipeline. The local pull model allows Prometheus to know why exactly the metric target has problems and when. Maybe it’s down, misconfigured, restarted, too slow to give us metrics (e.g. CPU saturated), not discoverable by service discovery, we don’t have credentials to access or just DNS, network, or the whole cluster is down. By putting our scraper outside of the network, we risk losing some of this information by introducing unreliability into scrapes that is unrelated to an individual target. On top of that, we risk losing important visibility completely if the network is temporarily down. Please don’t do it. It’s not worth it. (:

That seems to shoot this idea down as a bad one.

(2) feels ungainly and heavy.

I discounted (3) pretty quickly because Prometheus makes it clear that the pushgateway is not really intended for this kind of use-case, but more for short-lived jobs.

Which leads me toward (4) and (5), but I’m having trouble contrasting those two approaches. The idea of keeping everything Grafana has some appeal to it, assuming I’m even right on (5) about not needing a Prometheus instance. That said, Grafana Agent is pre-1.0, which makes me nervous. On the other hand, I think agent mode in Prometheus is still behind a feature flag. Which sends me back to the beginning of my list saying “surely this architecture isn’t so outlandish that only brand new and immature tools support it, I must be missing something.”

Can someone please knock some sense into me? Thank you!