Discrepancy Between k6 Console Output and Grafana Dashboard for p90/p95 Metrics

I am using the k6 Prometheus dashboard with the remote write option to send quantile metrics (min, max, p90, p95, p99).

I’m having trouble matching the API request duration metrics from the local k6 console output with the Grafana dashboard. I have my test case set up with API tags, and I modified the “Requests by URL” panel to group by api instead of the default name, method, and status. I also swapped the p99 metric for p90 since that’s what is output in the console log.

Queries

avg by(name, method, status, api) (k6_http_req_duration_min{testid=~"$testid"})
avg by(name, method, status, api) (k6_http_req_duration_max{testid=~"$testid"})
avg by(name, method, status, api) (k6_http_req_duration_p90{testid=~"$testid"})
avg by(name, method, status, api) (k6_http_req_duration_p95{testid=~"$testid"})

Transformations

When comparing the dashboard to the k6 console output, the min and max values align perfectly as expected. However, the Grafana dashboard uses “Mean” by default for the p90 and p95 metrics, which does not match the console output.

I experimented with different calculation methods. Switching to “Last” for the p90 and p95 metrics made them match the console output exactly. I also tried using the “90th%” and “95th%” percentile options. While sometimes these percentiles were accurate, other times they were significantly off compared to the console output.

Console output:

Dashboard displaying p90/p95 calculations using different methods:

Given these discrepancies, I’m wondering if there might be a mistake in how the dashboard is configured. Should the quantile calculations (p90/p95) be set to “Last” instead of “Mean” for more accurate results?

Additional Tests

I ran a few more test cases to confirm my findings. For some of these cases, neither “Mean” nor “Last” gave me results that matched the console output for the quantile columns. In one particular case, where the test scenario was more complex (with additional tags, APIs, and moving parts), I observed significantly different results.

Console output:

In the following dashboard image, I used “Mean,” “Last,” and the “90th%” and “95th%” percentile options for p90 and p95 metrics.

Dashboard p90/95 using “Mean,” “Last,” and “Percentile” calculations:

Here’s what I found:

  • The “Mean” calculation didn’t match the console output for the percentile values. It also showed the same value for p90, p95, etc., across most APIs. It seems that the k6_http_req_duration_$quantile_stat metric was returning identical data for different quantiles for many APIs. Any idea why this might happen?
  • The “Last” calculation gave slightly different results but was still nearly identical to “Mean” for most APIs. This makes it hard to differentiate between the two.
  • The “90th%” and “95th%” percentiles came the closest to matching the console output, but they were still off for all API calls.
  • I noticed an extra row in the dashboard that seems to come from calls without an api tag. These calls don’t show up in the console output, and I’m wondering if they might be from untagged requests or something else entirely.

Questions

I would really appreciate some guidance on the following points:

  • What is the best way to calculate API request durations in Grafana? “Last” seems to work in some cases but not all. Is there a more reliable approach?
  • Why might the k6_http_req_duration_$quantile_stat metric return the same data for different quantiles in my more complex test cases?
  • Could the extra row with untagged requests explain some of the mismatches, or is this the request durations not associated with an api tag?

Thanks in advance for any help!

I’ve figured out the issue and wanted to share my findings!

After reviewing my test setup and comparing it with a similar problem I experienced (which I detailed in this post), I found that the root cause was related to incorrectly set tags. Specifically, my API requests were being split into thousands of unique entities because of dynamic URLs, which led to incorrect quantile calculations. Once I corrected the tags by adding a consistent name tag for each API, the quantiles started working properly in Grafana.

Regarding the p90/p95 calculations, I found that the Last* calculation method is the correct way to match the console output with the dashboard. The k6 Prometheus dashboard (available here) uses the Mean by default for quantile values, which does not correlate with the quantiles in the console output. Is there a reason why Mean is used in this dashboard for quantile calculations, or could this be something that should be updated?

Hi @dylanpaulson !

Thanks for the update on the topic!

Is there a reason why Mean is used in this dashboard for quantile calculations, or could this be something that should be updated?

TBH, I don’t have the answer to that question, but I believe it could be raised as part of the issue or pull request (feel free to add a contribution if you want) to the dashboard. The code is located here xk6-output-prometheus-remote/grafana/dashboards at main · grafana/xk6-output-prometheus-remote · GitHub

And if I’m not mistaken, these are PRs for adding this dashboard:

Once (and if) the changes are accepted there, we could update the k6 Prometheus | Grafana Labs

Hope that explains a bit!

Cheers!

Thanks for the response!

I went ahead and created a pull request addressing the issue. Here is a link to the request.

1 Like