I am using the k6 Prometheus dashboard with the remote write option to send quantile metrics (min, max, p90, p95, p99).
I’m having trouble matching the API request duration metrics from the local k6 console output with the Grafana dashboard. I have my test case set up with API tags, and I modified the “Requests by URL” panel to group by api
instead of the default name
, method
, and status
. I also swapped the p99 metric for p90 since that’s what is output in the console log.
Queries
avg by(name, method, status, api) (k6_http_req_duration_min{testid=~"$testid"})
avg by(name, method, status, api) (k6_http_req_duration_max{testid=~"$testid"})
avg by(name, method, status, api) (k6_http_req_duration_p90{testid=~"$testid"})
avg by(name, method, status, api) (k6_http_req_duration_p95{testid=~"$testid"})
Transformations
When comparing the dashboard to the k6 console output, the min and max values align perfectly as expected. However, the Grafana dashboard uses “Mean” by default for the p90 and p95 metrics, which does not match the console output.
I experimented with different calculation methods. Switching to “Last” for the p90 and p95 metrics made them match the console output exactly. I also tried using the “90th%” and “95th%” percentile options. While sometimes these percentiles were accurate, other times they were significantly off compared to the console output.
Console output:
Dashboard displaying p90/p95 calculations using different methods:
Given these discrepancies, I’m wondering if there might be a mistake in how the dashboard is configured. Should the quantile calculations (p90/p95) be set to “Last” instead of “Mean” for more accurate results?
Additional Tests
I ran a few more test cases to confirm my findings. For some of these cases, neither “Mean” nor “Last” gave me results that matched the console output for the quantile columns. In one particular case, where the test scenario was more complex (with additional tags, APIs, and moving parts), I observed significantly different results.
Console output:
In the following dashboard image, I used “Mean,” “Last,” and the “90th%” and “95th%” percentile options for p90 and p95 metrics.
Dashboard p90/95 using “Mean,” “Last,” and “Percentile” calculations:
Here’s what I found:
- The “Mean” calculation didn’t match the console output for the percentile values. It also showed the same value for p90, p95, etc., across most APIs. It seems that the
k6_http_req_duration_$quantile_stat
metric was returning identical data for different quantiles for many APIs. Any idea why this might happen? - The “Last” calculation gave slightly different results but was still nearly identical to “Mean” for most APIs. This makes it hard to differentiate between the two.
- The “90th%” and “95th%” percentiles came the closest to matching the console output, but they were still off for all API calls.
- I noticed an extra row in the dashboard that seems to come from calls without an
api
tag. These calls don’t show up in the console output, and I’m wondering if they might be from untagged requests or something else entirely.
Questions
I would really appreciate some guidance on the following points:
- What is the best way to calculate API request durations in Grafana? “Last” seems to work in some cases but not all. Is there a more reliable approach?
- Why might the
k6_http_req_duration_$quantile_stat
metric return the same data for different quantiles in my more complex test cases? - Could the extra row with untagged requests explain some of the mismatches, or is this the request durations not associated with an api tag?
Thanks in advance for any help!