Grafana not able to handle 50 concurrent users

  • What Grafana version and what operating system are you using?
    Grafana: 7.3.4
    OS: on azure Kubernetes

  • What are you trying to achieve?
    Have Grafana display the results from the DataSource in a timely manner for 50+ concurrent users.

  • How are you trying to achieve it?
    2 server instances of Grafana are deployed, either postgres or mysql used for database (also deployed on azure Kubernetes), and a single API functioning as the DataSource (also in azure Kubernetes).

  • What happened?
    Recreating a dashboard and refresh time of 5 seconds on JMeter and simulating 50 concurrent users lead to the conclusion that Grafana cannot handle the load.
    Isolating the API proved that it could handle the load.

First problem was the CPU usage, a single instance of Grafana quickly capped the assigned 2core node. Adding a second instance in a different node brought the CPU usage to reasonable levels, but times to answer the requests were still too high.

After turning on multiple log options on the config, the second problem was suspected to be the database, as the response time for requests increased, so did the time to query the database. Both Postgres and Mysql databases where tried, however both produced similar results, with high response times. (Results of a test can be observed in image1)

image1: test with loadBalancer > 2 grafanas > loadbalancer > 2 Mysql databases

  • What did you expect to happen?
    Grafana to be able to concurrently handle the load of 50 users. This means maintaining the utilized dashboard of ~~10 panels with a refresh rate of 5 seconds for each user.

  • Can you copy/paste the configuration(s) that you are having problems with?
    I do not believe I have problems with the configuration. Everything is working fine except that Grafana is not able to handle 50 users at the same time.
    however here are the changed configs (via yaml):
    env:
    - name: GF_LOG_LEVEL
    value: debug
    - name: GF_LOG_MODE
    value: file
    - name: GF_PATHS_LOGS
    value: /var/lib/grafana/log
    - name: GF_DATAPROXY_LOGGING
    value: ‘true’
    - name: GF_LOG_FRONTEND_ENABLED
    value: ‘true’
    - name: GF_SERVER_ROUTER_LOGGING
    value: ‘true’
    - name: GF_DATABASE_LOG_QUERIES
    value: ‘true’
    - name: GF_AUTH_TOKEN_ROTATION_INTERVAL_MINUTES
    value: ‘1000’
    - name: GF_SERVER_DOMAIN
    value: -domain-
    - name: GF_SERVER_ROOT_URL
    value: https://%(domain)s/grafana
    - name: GF_INSTALL_PLUGINS
    value: >-
    -custom pluggin to comunicate with API-;onecc-plugin,https://grafana.com/api/plugins/grafana-piechart-panel/versions/latest/download;grafana-piechart-panel
    - name: GF_AUTH_GENERIC_OAUTH_ENABLED
    value: ‘true’
    - name: GF_AUTH_GENERIC_OAUTH_TLS_SKIP_VERIFY_INSECURE
    value: ‘true’
    - name: GF_AUTH_GENERIC_OAUTH_CLIENT_ID
    value: -client id-
    - name: GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET
    value: -client secret-
    - name: GF_AUTH_GENERIC_OAUTH_EMAIL_ATTRIBUTE_PATH
    value: sub
    - name: GF_AUTH_GENERIC_OAUTH_AUTH_URL
    value: -link-
    - name: GF_AUTH_GENERIC_OAUTH_TOKEN_URL
    value: -link-
    - name: GF_AUTH_GENERIC_OAUTH_API_URL
    value: -link-
    - name: GF_AUTH_GENERIC_OAUTH_AUTHORITY_URL
    value: -link-
    - name: GF_AUTH_GENERIC_OAUTH_SCOPES
    value: openid profile datastudio_api offline_access
    - name: GF_AUTH_SIGNOUT_REDIRECT_URL
    value: -link-
    - name: GF_AUTH_OAUTH_STATE_COOKIE_MAX_AGE
    value: ‘60000’
    - name: GF_AUTH_TOKEN_ROTATION_INTERVAL_MINUTES
    value: ‘1000’
    - name: GF_DATABASE_TYPE
    value: mysql
    - name: GF_DATABASE_HOST
    value: -ip-:-port-
    - name: GF_DATABASE_USER
    valueFrom:
    secretKeyRef:
    name: mysqlpw
    key: username
    - name: GF_DATABASE_PASSWORD
    valueFrom:
    secretKeyRef:
    name: mysqlpw
    key: password
    - name: GF_DATABASE_MAX_IDLE_CONN
    value: ‘70’
    - name: GF_DATABASE_MAX_OPEN_CONN
    value: ‘70’

  • Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.
    Throughout the experiments made, some errors were encountered, but some tweaking of the configs solved them. (An example is a problem where Grafana would create too many connections with the database, solved by configuring “GF_DATABASE_MAX_OPEN_CONN”)

  • Did you follow any online instructions? If so, what is the URL?
    Did not follow any online instructions in particular for configuring Grafana.

Enable and use tracing to find slow operations/component. Monitor it properly: CPU utilization has many subtypes - writing logs to the files also looks like opportunity for high cpu io utilization.

It is not clear what is in that test included, but it looks like hight login rate, which looks suspicious to me.

I have not been able to use the tracing, but by disabling the extra logs the performance improved.
Now the requests stay below 2 seconds.
Thanks for the help.