K6_http_req_duration_$quantile_stat Metrics are the Same Across Quantiles for Certain APIs

dylanpaulson · October 3, 2024, 9:57pm

I’m facing an issue with Grafana when using k6 to send trend metrics with the $quantile_stat method. We’re using the older method of sending metrics where the trend quantiles (like http_req_duration_$quantile_stat) are pre-calculated in k6 before being sent to Prometheus and displayed in Grafana.

When running a specific test case and switching the trend metric query to different quantile values in Grafana, the panels don’t update properly. In each iteration of the test, a login API is called, followed by one of several other APIs based on the challenge case selected, and then a cleanup API. The only API that seems to reflect changes when switching the quantile is the login API, while all other APIs remain static, showing no differences across the quantiles.

To troubleshoot, I used port forwarding to directly view the Prometheus (version 2.53.1) graphs for the k6_http_req_duration_$quantile_stat metrics. I plotted all the APIs on a single graph. Switching between quantile values did not cause any changes in the graphs except for the login API.

Here are screenshots of the graphs showing the results for api1 with quantile min and max, and as you can see, they are the same:

Scenario Configuration

Here’s the scenario I’m running:

"scenarios": {
  "api_test": {
    "executor": "constant-arrival-rate",
    "exec": "testName",
    "duration": "30m",
    "rate": 20,
    "timeUnit": "60s",
    "preAllocatedVUs": 200
  }
}

In this setup, I’m using a constant arrival rate executor, targeting 20 iterations per minute for a total of 30 minutes, with 200 pre-allocated virtual users. Each iteration runs a run function, which picks a challenge from a list of cases.

Test case code:

function run(data, challengeCase) {
  login_api()
  switch (challengeCase) {
    case 1:
      api1(data);
      break;
    case 2:
      api2(data);
      break;
    ...
    case 8:
      api8(data);
      break;
    case 9:
      break;
  }

  cleanup_api(data);  
}

export function testName(data) {
  let caseNum = randomIntBetween(1, 8);
  run(data, caseNum);
}

In this setup, each case triggers a different API call, with a “cleanup” API running at the end of each iteration. In Prometheus, when graphing k6_http_req_duration_$quantile_stat for each API, the login API is the only one that changes when the $quantile_stat is modified, while the others remain unaffected by the $quantile_stat. I initially thought this might be because the login API runs with every iteration, which could explain why it changes with the quantile. However, the cleanup API also runs at the end of every iteration, yet its metrics remain static regardless of the quantile.

Additional Tests

Since this test case is part of a larger codebase with many dependencies, I wanted to isolate the issue. To do so, I created a custom test case with dummy API calls, similar to this one, and when I reran the test, everything worked perfectly — the quantile metrics updated as expected across the board.

This leaves me wondering if there’s something specific about my original test case or APIs causing the min, p90, p95, p99, and max values to remain the same for an API, regardless of the quantile.

Has anyone experienced this before or have any ideas why the quantiles wouldn’t change for an API with this type of test case or executor? Could there be something I’m overlooking that causes the values to remain identical for different quantiles?

dylanpaulson · October 9, 2024, 6:08pm

After further investigation, I was able to resolve the issue by using Counter and Gauge metrics in combination with adding name tags for each API. Initially, I was tagging the APIs we wanted to track with an api tag, like this and grouping by api on the dashboard:

tags: {
    api: "api_name",
}

However, while the min and max values seemed to align between the dashboard and the console, the quantile values (such as p90, p95) were inconsistent. After reviewing the .csv output from one of the tests, I noticed that the “name” column for each API was identical to the “url” column. Since the URLs were dynamically generated for each request, this resulted in thousands of unique “names” for what should have been grouped metrics. Each request was treated as a separate entity, which I believe led to incorrect quantile calculations.

Solution:

I added a name tag to the relevant APIs, grouping the metrics under a common label to prevent them from being split by dynamic URLs. Here’s how I updated the tag structure:

tags: {
    api: "api_name",
    name: "tracked_metric",
}

This change helped ensure that all requests related to the same API are aggregated together, and it fixed the quantile discrepancies. Now, when I switch between quantiles in Grafana, the metrics update as expected for all APIs.

Root Cause Theory:

I believe that the root cause was the dynamically generated URLs in the “name” column, which resulted in each request being treated as a unique entity in Prometheus. By adding a consistent name tag, I was able to aggregate the metrics properly.

Does this explanation make sense, or could there be something else at play that caused the initial behavior?

joanlopez · November 20, 2024, 10:19am

Hey @dylanpaulson,

Sorry for the delay, apologies!

Does this explanation make sense, or could there be something else at play that caused the initial behavior?

I think what you described makes lot of sense, indeed the workaround you found is what’s suggested in docs:

“By default, tags have a name field that holds the value of the request URL. If your test has dynamic URL paths, you might not want this behavior, which could bring a large number of unique URLs into the metrics stream.”

So, I’m glad to know that, despite it took some time for us to have capacity to look at your case, you found root cause and the solution that fast. Next time we’ll try to be more diligent.

Thanks for sharing your case here, so anyone else in the community experiencing the same kind of issue can learn from it!

Cheers!

Topic		Replies	Views
Discrepancy Between k6 Console Output and Grafana Dashboard for p90/p95 Metrics Grafana k6 promethues	3	218	October 17, 2024
Missing K6 Metrics (P95 Response Time) in Grafana Dashboard OSS Support	14	2081	November 19, 2024
Test Results from Official k6 Grafana dashboard and on k6 console are mismatched OSS Support	2	390	August 21, 2023
Calculate the quantile with sparse metrics Dashboards	0	179	July 30, 2024
What am I missing? Grafana k6 grafana-ui , metrics	1	172	March 19, 2024