We are running in a clustered environment with grafana and prometheus.
Lately we have some issues with huge dashboards which are not loading (only in a few minutes sometimes) and in cadvisor we see grafana container has spikes in cpu over 100%. But the actual server doesn’t struggle with cpu saturation.
In the logs, we see these errors:
Failed creating data source proxy error="validation of data source URL \"\" failed: empty URL string" traceID=
level=error msg="Request Completed" method=GET path=/api/datasources/proxy/uid/...../api/v1/status/buildinfo status=500 remote_addr=ip time_ms=1 duration=1.923898ms size=61 referer="https://url/dashboard?from=now-1m&orgId=1&to=now" handler=/api/datasources/proxy/uid/:uid/*
This loading issues are not common on all dashboards.
Problem is usually ineficient panel query - e.g. query a lot of data into Grafana, chew it and then visualize result as a single number. So what your queries are doing?
I wouldn’t say it contains complex queries, I am mostly displaying count of a query and I am querying over the last minute and no refresh - loads in 3-4minutes / it looks like in tries multiple times to display the values in the dashboard and only after a while it succeeds. Otherwise, it never loads.
Queries don’t look heavy. Any transformations? Dashboard autorefresh? How your Prometheus behave when there is 17 parallel queries - enable lazy loading (you have quite old version, so check doc how)?
we noticed the issue in the dashboards was caused by multiple panels which filter alerts - alert list type of panel. But we still encounter slowness sometimes, on some dashboards, event after removing some of the alert list panels. What performance issues can cause the Alert List panels?