Grafana dashboard best practice for large scale monitoring

We have spark clusters with 100-200 nodes and we plot several metrics of executors, driver

We are not sure what’s the best way to create a dashboard at such scale? Visualizing all the 100-200 nodes and executor stats doesn’t surface the problem as there is lot of noise. It also slows down the dashboard tremendously

What are some good practices around grafana dashboards?

  1. Visualize using top K
  2. Plot only anomalies? How do we detect anomalies?
  3. How to reduce noise?
  4. How to make the dashboard more performant?

We use prometheus in the backend

welcome to the :grafana: forum, @fhalde

one way to make dashboards more performant is to use the -- Dashboard -- special datasource:

This allows you to use the results from another panel in this new panel. This can help avoid duplicate queries when, say, you are visualizing the same data two different ways.

1 Like

This one is amazing! Beautiful tip !! thank you

Apart from this, do you think the general recommendation would be to reduce max datapoints? We look at dashboards with 24h+ intervals sometimes. IDK how to best set the value that gets managed dynamically whenever the timeline changes and which does not hide spikes from graphs

One way to always see spike would be to use max everywhere