Can we reduce load time of grafana pages with 25k datasources?

  • What Grafana version and what operating system are you using?
    Grafana version : 9.5.12
    We are deploying via helm chart
    helm chart version: 7.0.21

  • What are you trying to achieve?
    We have a requirement to scale up the datasources to around 25k. After we scale up the datasource, we faced lag of around 20-25 secs on initial page loading. Switching tabs between different pages was not the issue, but when we do refresh on any page it takes 20 - 25 secs of time. This is because grafana tries to load all the data on the initial loading.

  • Also when we try to open datasource dropdown, it takes 3-4 secs to open. Here the rendering and scripting is taking time.
    Screenshot 2024-01-19 at 2.35.25 PM

We want to reduce this time to minimal.

  • What optimisation have we tried?

  • Enabled gzip option to reduce the download size => enable_gzip : true

  • Enabled client side caching by enabling cache-control header on response

  • Tried to add nginx sidecar with grafana to reduce chunk download time of files

  • Disbaled grafana analytics options

After trying all this solutions, we came to 20 secs of load time. If anyone has insights or recommendations for alternative solutions that could further optimize the load time, your suggestions would be greatly appreciated.

I think Grafana was never designated for 25k of data sources. What is the use case for it?

This is because grafana tries to load all the data on the initial loading.

That’s not proved from your waterfall diagram. But there is visible that page has ~100MB of resources (!!! - just for comparison, my random Grafana loads 10MB of resources for explore page).

Generally, first analyze what is causing delay, what is the biggest resource. Enable tracing and you can analyze it from the traces.
Only when you know what is causing a problem, then you may have idea how to improve. Any anser based on your current will be just guess.

In our grafana we need a requirement to create a datasource for each tenant and each zone, so we have total of 25k. Can you give the doc link of how to enable tracing through helm chart? I am not able to find it in grafana documentation

What is the biggest resource hogger in these 25k datasources

For example you speak of a drop-down that has lag, ehat data source is it?

We need more insight into the configuration of your dashboards, as in what do they show how are they accessed, by whom. Do each tenant view their own stuff only? Is it dlow for everyone or just superusers that see all tenants

One team can have one or more application Id(just a unique string). Under each appId they manage their resources like VMs, K8s pods etc. We identify metrics from differnent team based on this appId. So user can select their appId from datasource and they can see or query all metrics being emitted from that appid. We can’t reduce the datasource that’s why we need help in reducing the loading time of grafana

1 Like

Biggest resource is html page as shown top in the image, its size is 83.6MB without gzip.

why its this big?
because it contains the data about all 25k datasource inside window.grafanaBootData. here is an example of one such datasource json object. It contains 25k such json objects

{
    "id": 12835,
    "uid": "dbc7490b-43fa-40dd-ab68-2e30828865c1",
    "type": "prometheus",
    "name": "appId",
    "meta":
    {
        "id": "prometheus",
        "type": "datasource",
        "name": "Prometheus",
        "info":
        {
            "author":
            {
                "name": "Grafana Labs",
                "url": "https://grafana.com"
            },
            "description": "Open source time series database & alerting",
            "links":
            [
                {
                    "name": "Learn more",
                    "url": "https://prometheus.io/"
                }
            ],
            "logos":
            {
                "small": "public/app/plugins/datasource/prometheus/img/prometheus_logo.svg",
                "large": "public/app/plugins/datasource/prometheus/img/prometheus_logo.svg"
            },
            "build":
            {},
            "screenshots": null,
            "version": "",
            "updated": ""
        },
        "dependencies":
        {
            "grafanaDependency": "",
            "grafanaVersion": "*",
            "plugins":
            []
        },
        "includes":
        [
            {
                "name": "Prometheus Stats",
                "path": "dashboards/prometheus_stats.json",
                "type": "dashboard",
                "component": "",
                "role": "Viewer",
                "addToNav": false,
                "defaultNav": false,
                "slug": "",
                "icon": "",
                "uid": ""
            },
            {
                "name": "Prometheus 2.0 Stats",
                "path": "dashboards/prometheus_2_stats.json",
                "type": "dashboard",
                "component": "",
                "role": "Viewer",
                "addToNav": false,
                "defaultNav": false,
                "slug": "",
                "icon": "",
                "uid": ""
            },
            {
                "name": "Grafana Stats",
                "path": "dashboards/grafana_stats.json",
                "type": "dashboard",
                "component": "",
                "role": "Viewer",
                "addToNav": false,
                "defaultNav": false,
                "slug": "",
                "icon": "",
                "uid": ""
            }
        ],
        "category": "tsdb",
        "preload": false,
        "backend": true,
        "routes":
        [
            {
                "path": "api/v1/query",
                "method": "POST",
                "reqRole": "Viewer",
                "url": "",
                "urlParams": null,
                "headers": null,
                "authType": "",
                "tokenAuth": null,
                "jwtTokenAuth": null,
                "body": null
            },
            {
                "path": "api/v1/query_range",
                "method": "POST",
                "reqRole": "Viewer",
                "url": "",
                "urlParams": null,
                "headers": null,
                "authType": "",
                "tokenAuth": null,
                "jwtTokenAuth": null,
                "body": null
            },
            {
                "path": "api/v1/series",
                "method": "POST",
                "reqRole": "Viewer",
                "url": "",
                "urlParams": null,
                "headers": null,
                "authType": "",
                "tokenAuth": null,
                "jwtTokenAuth": null,
                "body": null
            },
            {
                "path": "api/v1/labels",
                "method": "POST",
                "reqRole": "Viewer",
                "url": "",
                "urlParams": null,
                "headers": null,
                "authType": "",
                "tokenAuth": null,
                "jwtTokenAuth": null,
                "body": null
            },
            {
                "path": "api/v1/query_exemplars",
                "method": "POST",
                "reqRole": "Viewer",
                "url": "",
                "urlParams": null,
                "headers": null,
                "authType": "",
                "tokenAuth": null,
                "jwtTokenAuth": null,
                "body": null
            },
            {
                "path": "/rules",
                "method": "GET",
                "reqRole": "Viewer",
                "url": "",
                "urlParams": null,
                "headers": null,
                "authType": "",
                "tokenAuth": null,
                "jwtTokenAuth": null,
                "body": null
            },
            {
                "path": "/rules",
                "method": "POST",
                "reqRole": "Editor",
                "url": "",
                "urlParams": null,
                "headers": null,
                "authType": "",
                "tokenAuth": null,
                "jwtTokenAuth": null,
                "body": null
            },
            {
                "path": "/rules",
                "method": "DELETE",
                "reqRole": "Editor",
                "url": "",
                "urlParams": null,
                "headers": null,
                "authType": "",
                "tokenAuth": null,
                "jwtTokenAuth": null,
                "body": null
            },
            {
                "path": "/config/v1/rules",
                "method": "DELETE",
                "reqRole": "Editor",
                "url": "",
                "urlParams": null,
                "headers": null,
                "authType": "",
                "tokenAuth": null,
                "jwtTokenAuth": null,
                "body": null
            },
            {
                "path": "/config/v1/rules",
                "method": "POST",
                "reqRole": "Editor",
                "url": "",
                "urlParams": null,
                "headers": null,
                "authType": "",
                "tokenAuth": null,
                "jwtTokenAuth": null,
                "body": null
            }
        ],
        "skipDataQuery": false,
        "autoEnabled": false,
        "annotations": true,
        "metrics": true,
        "alerting": true,
        "explore": false,
        "tables": false,
        "logs": false,
        "tracing": false,
        "queryOptions":
        {
            "minInterval": true
        },
        "streaming": false,
        "signature": "internal",
        "module": "app/plugins/datasource/prometheus/module",
        "baseUrl": "public/app/plugins/datasource/prometheus"
    },
    "url": "/api/datasources/proxy/uid/dbc7490b-43fa-40dd-ab68-2e30828865c1",
    "isDefault": false,
    "access": "proxy",
    "preload": false,
    "module": "app/plugins/datasource/prometheus/module",
    "jsonData":
    {
        "directUrl": "http://10.83.23.34/select/4855/prometheus"
    },
    "readOnly": false,
    "cachingConfig":
    {
        "enabled": false,
        "TTLMs": 0
    }
}```

So cache/gzip that 83.6MB resource/response - but not with Grafana, but with real cache service (e.g. nginx, varnish, …) in front of Grafana. But you will solve this problem and then there can be million other problems. As you see Grafana just doesn’t scale for this use case.

Your initial image doesn’t show that resource, so your initial analysis doesn’t give any clue what else is causing a problem:

These are multiple screenshots with resources sorted by time taken