I am unable to reduce the load of Grafana

I am using Grafana in a Proof of Concept project. I feed the network traffic summary and the latency data of about 500 networks and their 2 ISP connections to InnoDB. I created one dashboard for each network in Grafana, using the provisioning directory. Both InnoDB and Grafana are running as a container. A small python script is feeding the data to InnoDB, and another one configuring the Grafana. In total I have about 500 dashboards. My goal would be to embed these Dashboards into a web application. So Grafana would be used only on demand, when it would need to render a graph (or a dashboard). Now my problem is, that even when noone is accessing Grafana, there is quite substantial load on the server. The uptime command shows numbers like 300, or even more. The server is still quite responsive, so the overall result is still not very bad. Looking with htop I can see that the load jumps up and down, for some seconds all cores are idling, then goes up to 100%, then again 0%. Grafana has hundreds of threads running. My initial thought was that if I switch off alerting, Grafana should stop checking and I could go to a near zero load. Unfortunately, that did not make a trick. I tried to search, but I could not find anything relevant. Could anyone please point me to the right direction? Is this even possible with Grafana?

Thanks,

A.

Why it is bad? I guess it is load - it just saying that 300 processes is waiting for something. It can be storage (slow storage, low iops performance), network (slow network), …
It doesn’t indicates that Grafana is doing something, it can be that InnoDB is just writing (processing, compacting, indexing, …) something. You need to find which processes are waiting and why. Generally high load doesn’t mean any overloading.

Well, it is grafana, for a fact, as I checked that. Otherwise I would not be posting here… :slight_smile: My point is that grafana should not be doing anything. Normally only the InnoDB part should be working, as there is data fetched from the source and pushed to InnoDB. The Grafana part should come into play when I want to see the graphs. This is what I am not able to achieve, as Grafana is doing something, which I do not know what is. Maybe this is how it works and it can not be changed?

Stop Grafana and show current uptime 1-min. If it continues, then provide full reproducible Grafana setup (not just I have some Grafana which have high load) and iostat, vmstat, netstat outputs.

that sounds like you are past POC :wink:

What backend are you using for grafana: sqlite or other?

Haha, you might ask if this is the PoC, what will be the production, isn’t it? InnoDB, I mentioned it.

:scream:

I skipped coffee sorry, InnoDB it is.

Have you looked at fine tuning best practices with InnoDB just in case the issue is there?

https://dev.mysql.com/doc/refman/8.0/en/innodb-configuring-io-capacity.html

like @jangaraj said the burden of proof is on you :slight_smile:

you would be surprised how many people post here issues with performance and it ends up being something other than grafana: badly designed tables being queried, badly designed influxdb fields,tags etc

Hi Jangaraj, thanks for your help!

Here it is:

root@docker:~# uptime
15:26:03 up 1 day, 23:15, 1 user, load average: 245.54, 274.75, 288.42
root@docker:~# docker stop grafana
root@docker:~# uptime
16:05:20 up 1 day, 23:54, 1 user, load average: 0.89, 1.01, 23.51

The (virtual) server has 8 core assigned to it.

This is how I started grafana:
docker run
-d
–name grafana
-p 3000:3000
-e GF_SECURITY_ALLOW_EMBEDDING=true
-e GF_ALERTING_ENABLED=false
–volume /opt/grafana:/var/lib/grafana
–volume /opt/grafana_provisioning:/etc/grafana/provisioning/
–restart always
docker.io/grafana/grafana-oss

Data source:
/opt/grafana_provisioning/datasources/influxdb.yaml

apiVersion: 1
datasources:

  • name: influxdb
    type: influxdb
    access: proxy
    url: http://redacted:8086
    secureJsonData:
    token: redacted
    jsonData:
    version: Flux
    organization: ca4897a5012af79b
    defaultBucket: 1ee0a58969037b94

There are about 500 dashboards like this:
/opt/grafana_provisioning/dashboards/L_678.yaml
apiVersion: 1

providers:

  • name: ‘Country - City’
    orgId: 1
    folder: ‘’
    folderUid: ‘’
    type: file
    options:
    path: /etc/grafana/provisioning/dashboards/L_678.json

The json is a bit long to post here. Let me know if you see anything wrong in what I already posted.

It has about 10 panels with queries like this:

from(bucket: "KPI")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "traffic")
|> filter(fn: (r) => r["_field"] == "received")
|> filter(fn: (r) => r["network"] == "L_678")
|> filter(fn: (r) => r["line"] == "wan1")
|> filter(fn: (r) => r["role"] == "primary")
|> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
|> map(fn: (r) => ({r with _value: r._value / 1.0 }))
|> yield(name: "received")

from(bucket: "KPI")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "traffic")
|> filter(fn: (r) => r["_field"] == "sent")
|> filter(fn: (r) => r["network"] == "L_678")
|> filter(fn: (r) => r["line"] == "wan1")
|> filter(fn: (r) => r["role"] == "primary")
|> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
|> map(fn: (r) => ({r with _value: r._value / -1.0 }))
|> yield(name: "sent")

OK, so InnoDB is InfluxDB actually. We still don’t know Grafana version (I will “love” you when you will say that it’s “latest”, because “docker.io/grafana/grafana-oss” is latest image of course).
iostat, vmstat, netstat, docker stats outputs? You provided nothing just single number (load), which can means nothing. Enable tracing for your Grafana and check traces - you will see what Grafana is doing.

1 Like

Yep, InfluxDB. My mistake.

The grafana version is quite hidden, but here it is: Version 10.1.0 (commit: ff85ec33c5, branch: HEAD)

iostat

Linux 5.15.0-92-generic (docker)        01/31/2024      _x86_64_        (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          31.88    0.00   23.13    0.17    0.00   44.82

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
loop0             0.00         0.01         0.00         0.00       1224          0          0
loop1             0.00         0.02         0.00         0.00       3056          0          0
loop2             0.00         0.01         0.00         0.00       2223          0          0
loop3             0.00         0.01         0.00         0.00       1140          0          0
loop4             0.00         0.00         0.00         0.00        372          0          0
loop5             0.01         0.56         0.00         0.00      99624          0          0
loop6             0.00         0.01         0.00         0.00       1201          0          0
loop7             0.00         0.01         0.00         0.00       2637          0          0
sda             572.70        13.47      3847.22         0.00    2413977  689421148          0
sr0               0.00         0.12         0.00         0.00      20922          0          0

vmstat

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
117  0  67840 315304 255932 2025040    0    0     2   485   29    5 32 23 45  0  0

docker stat

ID            NAME        CPU %       MEM USAGE / LIMIT  MEM %       NET IO             BLOCK IO           PIDS        CPU TIME          AVG CPU %
704686452b57  grafana     600.90%     659.8MB / 4.101GB  16.09%      355.9kB / 222.5kB  11.85MB / 87.18MB  848         41m40.421414s     300.45%
bd23b0bc3e75  influxdb    0.67%       425.1MB / 4.101GB  10.37%      8.444GB / 20.47GB  598.4MB / 141.1GB  23          12h55m11.388838s  0.33%

docker ps

Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
CONTAINER ID  IMAGE                                 COMMAND     CREATED         STATUS             PORTS                   NAMES
bd23b0bc3e75  docker.io/library/influxdb:latest     influxd     4 months ago    Up 2 days ago      0.0.0.0:8086->8086/tcp  influxdb
704686452b57  docker.io/grafana/grafana-oss:latest              23 minutes ago  Up 12 minutes ago  0.0.0.0:3000->3000/tcp  grafana

netstat

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 localhost:52346         localhost:8086          TIME_WAIT
tcp        0      0 docker:32914            158.115.147.239:https   TIME_WAIT
tcp        0      0 localhost:52342         localhost:8086          ESTABLISHED
tcp        0      0 localhost:57442         localhost:8086          TIME_WAIT
tcp        0      0 docker:ssh              10.10.0.61:57550        ESTABLISHED
tcp        0      0 docker:40942            158.115.147.239:https   TIME_WAIT
tcp        0      0 docker:43662            158.115.147.212:https   TIME_WAIT
tcp        0      0 docker:40928            158.115.147.239:https   ESTABLISHED
tcp        0    244 docker:ssh              10.10.0.61:60724        ESTABLISHED
tcp        0      0 localhost:46840         localhost:5000          TIME_WAIT
tcp        0      0 localhost:57446         localhost:8086          TIME_WAIT
Active UNIX domain sockets (w/o servers)
[cut for saving some space]

I am still struggling with the traces. Enabled the trace, have the file, have go installed, and this is the result:
go tool trace /opt/grafana/trace.out
2024/01/31 18:20:17 Parsing trace…
failed to parse trace: no EvFrequency event

A.

Please format your output (don’t torture us please) next time. All those stats command has some parameters, but still better than nothing. You have problem with IOPs. Use standard debug approach (what you should use from the start): increase logs level and watch Grafana server logs. Use App tracing, not golang tracing: Configure Grafana | Grafana documentation

Blind guess: you mixed dashboards (dashboard uids, versions, names) so now provisioning is confused and overwriting dashboards in DB over and over and over …

Also wonder if these settings are also affecting things. set log_queries for a small time period to see what is happening, not sure if set to true by deault.

Found what was the issue. Raised the log level with specifying -e GF_LOG_LEVEL=debug for the container. The log was showing: msg=“Start walking disk” messages. Apparently, by default the provisioned dashboards are checked quite often for changes. I added “updateIntervalSeconds: 600” to the dashboard yaml files and now the load is very nicely below one. I could even raise this number much higher as when I change the yaml files, I can notify Grafana about the change. Thanks for you all guys!