Consistent response time Spikes in Grafana

  • What Grafana version and what operating system are you using?
    Grafana: 9.3.8
    OS: on azure Kubernetes

  • What are you trying to achieve?
    Load testing our Grafana Deployment.

  • How are you trying to achieve it?
    3 server instances of Grafana are deployed, Grafana Database is postgres (also deployed on azure Kubernetes), and 3 APIs functioning as the DataSource (also in azure Kubernetes).
    Using JMeter to simulate 100 users accessing a dashboard each.
    Endpoint used for JMeter requests: https:// “grafana link” /api/ds/query

  • What happened?
    Observing the results from the load test, we noticed a consistent spike in the requests time every 10 minutes.
    This spike gets worse with more users simulated in the test.

In the above JMeter graph, is the result of a test with 100 users.
The test was configured so that at the 10 minute mark, all users would be making the 4 requests (1 per panel, 4 Panels in the Dashboard).

The first consistent spike occurred at the ~4 minute mark, being the smallest with the least users active at the time.

  • What did you expect to happen?
    Response times being somewhat consistent.
    Response Times not having a consistent spike every x(10) minutes.

  • Can you copy/paste the configuration(s) that you are having problems with?
    We are using the following configuration via yaml:

            value: username
            value: "true"
          - name: GF_SERVER_DOMAIN
            value: <our domain>
          - name: GF_SERVER_ROOT_URL
            value: https://%(domain)s/grafana/
            value: <our plugin>
          - name: GF_INSTALL_PLUGINS
            value: >-
            <our plugin>,;grafana-piechart-panel
            value: <generic auth link>
            value: <generic auth link>
            value: <generic auth link>
            value: <generic auth link>
            value: <our redirect link>
            value: <our scopes>
            value: '300'
            value: '100'
            value: "true"
            value: "strict"
          - name: GF_DATABASE_TYPE
            value: postgres
          - name: GF_DATABASE_HOST
            value: <DB link>
          - name: GF_DATABASE_USER
                name: <secret>
                key: username
          - name: GF_DATABASE_PASSWORD
                name: <secret>
                key: password
          - name: GF_DATABASE_SSL_MODE
            value: require
          - name: GF_DATABASE_MAX_IDLE_CONN
            value: '30'
          - name: GF_DATABASE_MAX_OPEN_CONN
            value: '30'
  • Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly
    No errors

  • Did you follow any online instructions? If so, what is the URL?
    Did not follow any online instructions.

I would configure tracing for your Grafana, so you will have recorded traces from the load test.

Then you can filter traces based in duration and you will see what is a causing a problem in that particular trace. I guess there is some problem with DB, e.g. you reached max open connections, some table maintenance. Maybe you reached some Azure limits. These all are speculations, so use tracing and you will be 100% sure. Grafana is just application and it has already built-in tracing instrumentation, so use that.

1 Like

I obtained traces during a load test and picked some key traces:

average time trace 1.json (12.4 KB)
average time trace 2.json (12.4 KB)
high time trace.json (12.4 KB)
highest time trace.json (6.6 KB)
high authMiddleware 1.json (12.4 KB)

here is the JMeter response times graph:

the first 2 traces are obtained from times when the request times were “normal” ~200 milliseconds.
the next 2 are traces from points where the spike in response times occurred.
the last trace is in the end of one of the spikes where there is an unusual large gap of time between the “initContextWithToken” and executing the query in “pluginv2.Data/QueryData”.

Do you think you can give a more accurate diagnosis of the problem with this data?

Good day, does anyone have any insights on this problem?

I can’t load your traces. But utilise your trace backend, filter traces by span duration and check that waterfall diagram, e.g.:

I am using jaeger to see traces.

here is the waterfall diagram for the “Highest time trace”:

and the waterfal for the “average time trace 1”:

the time diference seems to be in the “initContextWithToken” span

I guess token rotation is causing a problem:

# How often should auth tokens be rotated for authenticated users when being active. The default is each 10 minutes.
token_rotation_interval_minutes = 10

I don’t know how it is implemented, but also OIDC token may be refreshed, so each token refresh will reach your IDP as well = external service, so it can be slow until all 100 test users have rotated/refreshed tokens. Try to use local auth instead of OAuth to prove it.

Sorry for taking so long to answer here.

The datasource we use requires the OAuth to answer requests, so the test was run using another datasource that works with both logins.

A base test was done to verify that the problem still occurred with the common datasource and our OAuth:

The regular spikes are still present, however, since the requests are so fast, they are more hidden.
We managed to find them by plotting the maximum of the times and discovered a spike every 10 minutes from the ~17 minute onwards.

This is the test with local auth:

With local auth we observe the same regular spike roughly every 10 minutes.

The following applies when using Grafana’s built in user authentication

So it looks like token_rotation_interval_minutes is applicable for all auth types and this is slow. So you may try to increase it to have nice load test graphs for management and then lower it back to default to for standard security.