Grafana docker container has CPU spikes and failes to load huge dashboards from multiple datasources

Hello,

We are running in a clustered environment with grafana and prometheus.

Lately we have some issues with huge dashboards which are not loading (only in a few minutes sometimes) and in cadvisor we see grafana container has spikes in cpu over 100%. But the actual server doesn’t struggle with cpu saturation.

In the logs, we see these errors:

Failed creating data source proxy error="validation of data source URL \"\" failed: empty URL string" traceID=

level=error msg="Request Completed" method=GET path=/api/datasources/proxy/uid/...../api/v1/status/buildinfo status=500 remote_addr=ip time_ms=1 duration=1.923898ms size=61 referer="https://url/dashboard?from=now-1m&orgId=1&to=now" handler=/api/datasources/proxy/uid/:uid/*

This loading issues are not common on all dashboards.

What can we do to debug or increase performance?

Grafana version: v10.0.3 (eb8dd72637)

Thank you!

Why it is a huge dashboard?

Problem is usually ineficient panel query - e.g. query a lot of data into Grafana, chew it and then visualize result as a single number. So what your queries are doing?

1 Like

I wouldn’t say it contains complex queries, I am mostly displaying count of a query and I am querying over the last minute and no refresh - loads in 3-4minutes / it looks like in tries multiple times to display the values in the dashboard and only after a while it succeeds. Otherwise, it never loads.

Be exact, pls: Datasource type and queries.

datasource type: prometheus → dashboard loads data from 3 DS.

Example of queries:

  • count(sum by (cluster) (up{job=”sso-exporter", env=“prd”,cluster != “”}) == 0 ) or on() vector(0)

  • max by (domain) (abs(agroal_active_count{env=“prd”,job=“quarkus-exporter”, domain=“domain”}))

  • count by () (

    (max by (application) (ibm_mq_queue_manager_status{env=“prd”}) != 1)

    or

    (max by (application) (up{job=“IBMMQ”,env=“prd”}) != 1)

    )

    or on() vector(0)

  • i have 17 vizualizations which contain queries similar to the ones above

Queries don’t look heavy. Any transformations? Dashboard autorefresh? How your Prometheus behave when there is 17 parallel queries - enable lazy loading (you have quite old version, so check doc how)?

also what backend is grafana using? sqlite? mysql?

We have mariadb with galera for clustering for backend.

1 Like

I have reduce transformation on almost all visuzlizations (Reduce → Series to Rows / Calculations Total).

I set dashboard autorefresh to OFF, otherwise it doesn’t load…

we noticed the issue in the dashboards was caused by multiple panels which filter alerts - alert list type of panel. But we still encounter slowness sometimes, on some dashboards, event after removing some of the alert list panels. What performance issues can cause the Alert List panels?