Consistent response time Spikes in Grafana

davidgoncalooliveira · November 13, 2023, 4:47pm

What Grafana version and what operating system are you using?
Grafana: 9.3.8
OS: on azure Kubernetes
What are you trying to achieve?
Load testing our Grafana Deployment.
How are you trying to achieve it?
3 server instances of Grafana are deployed, Grafana Database is postgres (also deployed on azure Kubernetes), and 3 APIs functioning as the DataSource (also in azure Kubernetes).
Using JMeter to simulate 100 users accessing a dashboard each.
Endpoint used for JMeter requests: https:// “grafana link” /api/ds/query
What happened?
Observing the results from the load test, we noticed a consistent spike in the requests time every 10 minutes.
This spike gets worse with more users simulated in the test.

In the above JMeter graph, is the result of a test with 100 users.
The test was configured so that at the 10 minute mark, all users would be making the 4 requests (1 per panel, 4 Panels in the Dashboard).

The first consistent spike occurred at the ~4 minute mark, being the smallest with the least users active at the time.

What did you expect to happen?
Response times being somewhat consistent.
Response Times not having a consistent spike every x(10) minutes.

Can you copy/paste the configuration(s) that you are having problems with?
We are using the following configuration via yaml:

      - name: GF_AUTH_GENERIC_OAUTH_EMAIL_ATTRIBUTE_PATH
        value: username
      - name: GF_SERVER_SERVE_FROM_SUB_PATH
        value: "true"
      - name: GF_SERVER_DOMAIN
        value: <our domain>
      - name: GF_SERVER_ROOT_URL
        value: https://%(domain)s/grafana/
      - name: GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS
        value: <our plugin>
      - name: GF_INSTALL_PLUGINS
        value: >-
        <our plugin>,https://grafana.com/api/plugins/grafana-piechart-panel/versions/latest/download;grafana-piechart-panel
      - name: GF_AUTH_GENERIC_OAUTH_AUTH_URL
        value: <generic auth link>
      - name: GF_AUTH_GENERIC_OAUTH_TOKEN_URL
        value: <generic auth link>
      - name: GF_AUTH_GENERIC_OAUTH_API_URL
        value: <generic auth link>
      - name: GF_AUTH_GENERIC_OAUTH_AUTHORITY_URL
        value: <generic auth link>
      - name: GF_AUTH_SIGNOUT_REDIRECT_URL
        value: <our redirect link>
      - name: GF_AUTH_GENERIC_OAUTH_SCOPES
        value: <our scopes>
      - name: GF_AUTH_OAUTH_STATE_COOKIE_MAX_AGE
        value: '300'
      - name: GF_AUTH_TOKEN_ROTATION_INTERVAL_MINUTES
        value: '100'
      - name: GF_SECURITY_COOKIE_SECURE
        value: "true"
      - name: GF_SECURITY_COOKIE_SAMESITE
        value: "strict"
      - name: GF_DATABASE_TYPE
        value: postgres
      - name: GF_DATABASE_HOST
        value: <DB link>
      - name: GF_DATABASE_USER
        valueFrom:
          secretKeyRef:
            name: <secret>
            key: username
      - name: GF_DATABASE_PASSWORD
        valueFrom:
          secretKeyRef:
            name: <secret>
            key: password
      - name: GF_DATABASE_SSL_MODE
        value: require
      - name: GF_DATABASE_MAX_IDLE_CONN
        value: '30'
      - name: GF_DATABASE_MAX_OPEN_CONN
        value: '30'

Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly
No errors
Did you follow any online instructions? If so, what is the URL?
Did not follow any online instructions.

jangaraj · November 13, 2023, 5:07pm

I would configure tracing for your Grafana, so you will have recorded traces from the load test.

Then you can filter traces based in duration and you will see what is a causing a problem in that particular trace. I guess there is some problem with DB, e.g. you reached max open connections, some table maintenance. Maybe you reached some Azure limits. These all are speculations, so use tracing and you will be 100% sure. Grafana is just application and it has already built-in tracing instrumentation, so use that.

davidgoncalooliveira · November 16, 2023, 11:00am

I obtained traces during a load test and picked some key traces:

average time trace 1.json (12.4 KB)
average time trace 2.json (12.4 KB)
high time trace.json (12.4 KB)
highest time trace.json (6.6 KB)
high authMiddleware 1.json (12.4 KB)

here is the JMeter response times graph:

the first 2 traces are obtained from times when the request times were “normal” ~200 milliseconds.
the next 2 are traces from points where the spike in response times occurred.
the last trace is in the end of one of the spikes where there is an unusual large gap of time between the “initContextWithToken” and executing the query in “pluginv2.Data/QueryData”.

Do you think you can give a more accurate diagnosis of the problem with this data?

davidgoncalooliveira · November 21, 2023, 10:25am

Good day, does anyone have any insights on this problem?

jangaraj · November 21, 2023, 12:32pm

I can’t load your traces. But utilise your trace backend, filter traces by span duration and check that waterfall diagram, e.g.:

davidgoncalooliveira · November 21, 2023, 1:24pm

I am using jaeger to see traces.

here is the waterfall diagram for the “Highest time trace”:

and the waterfal for the “average time trace 1”:

the time diference seems to be in the “initContextWithToken” span

jangaraj · November 21, 2023, 4:02pm

I guess token rotation is causing a problem:

# How often should auth tokens be rotated for authenticated users when being active. The default is each 10 minutes.
token_rotation_interval_minutes = 10

I don’t know how it is implemented, but also OIDC token may be refreshed, so each token refresh will reach your IDP as well = external service, so it can be slow until all 100 test users have rotated/refreshed tokens. Try to use local auth instead of OAuth to prove it.

davidgoncalooliveira · December 5, 2023, 11:08am

Sorry for taking so long to answer here.

The datasource we use requires the OAuth to answer requests, so the test was run using another datasource that works with both logins.

A base test was done to verify that the problem still occurred with the common datasource and our OAuth:

The regular spikes are still present, however, since the requests are so fast, they are more hidden.
We managed to find them by plotting the maximum of the times and discovered a spike every 10 minutes from the ~17 minute onwards.

This is the test with local auth:

With local auth we observe the same regular spike roughly every 10 minutes.

jangaraj · December 5, 2023, 12:04pm

The following applies when using Grafana’s built in user authentication

So it looks like token_rotation_interval_minutes is applicable for all auth types and this is slow. So you may try to increase it to have nice load test graphs for management and then lower it back to default to for standard security.

Topic		Replies	Views
Grafana not able to handle 50 concurrent users MySQL postgres , mysql , kubernetes	2	1791	February 14, 2022
Grafana performance degrades massively after sqlite->postgresql migration Configuration	0	848	July 1, 2019
Graphs not matching up with script expectations OSS Support influxdb , grafana	3	225	December 19, 2023
Grafana Tempo API search issues Grafana Tempo api	1	302	December 5, 2024
Advice on Grafana JIRA Request Duration Dashboard (High Average "/sr" Wait Times showing) Configuration	3	31	August 21, 2024

Consistent response time Spikes in Grafana

Related topics