Bug with k6 Threshold Validation

I noticed something today with k6 when running some tests and wanted to ask what was going on and if there is anyway for me to remedy it. When we are running our performance tests, we are setting an upper and lower bound for the thresholds for median times and p(90) times. I was running some tests today and was printing out the raw JSON data from the test to the console to look at it. Initially, all of the threshold statuses were accurate ( i.e. if the median time was greater than our upper bound it would say that the threshold for the upper bound on the median was not ok). However, as I continued running a few other tests later I started to notice that k6 seemed to be marking the wrong thresholds as failed. In the photo pasted below you can see that the median time for the test was 457.84 . Our lower bound on the median time was 39 and our upper bound was 95. Since 457.84 is greater than our lower bound 39, that threshold should be marked as “ok”: true and because 457.84 is not less than our upper bound 95, that threshold should be marked as “ok”: false. However, as you can see in the photo below, the opposite is true with “med>39” being marked as “ok”: false. and “med<95” being marked as “ok”: true. Let me know if I can provide any other information. Thank you for your help in advance!

mismarked_thresholds (1)

Hi @jill.lombardi :wave:

At first glance, it does look like a bug indeed, but let’s try to confirm that. It would be really helpful if you could indicate the k6 version you’re using, and if you could also provide us with an anonymized rough example of the script you’re running. That way we can cross-check and try to reproduce the bug as fast as possible.

If we’re able to reproduce it, we’ll likely open a GitHub issue to track it and its future resolution.

Thanks a lot for your help :bowing_man:

Hello! For the version of k6, we are starting a docker container after running docker pull loadimpact/k6. Hopefully this info is what you need, but let me know if there’s another place I could look for the version. Additionally, here is a rough example of what we are running:

import http from 'k6/http';
import { Trend, Rate } from 'k6/metrics';

export let options = {
    thresholds: {
        'example_trend_response_time': ['med>39', 'med<95'],  
        'example_rate_successful_requests': ['rate==1'],
    },
   scenarios: {
        exampleScenario: {
          executor: 'constant-arrival-rate',
          exec: 'exampleFunction',
          rate: 1,
          timeUnit: '30s',
          duration: '30s',
          preAllocatedVUs: 2,
        },
    },
};

const putExampleTrend = new Trend('example_trend_response_time');
const putExampleSuccessRate = new Rate('example_rate_successful_requests');

export function exampleFunction() {

    let examplePUT = http.put(`https://fake/endpoint`,{}, { headers: { 'x-csrf-token': fakeToken, } }) ;

    // exampleValidationFunction() outputs logging messages to the console 
    // and gets the response time and status of the request
    let examplePUTInfo = exampleValidationFunction(examplePUT, 200, exampleFunction);
    putExampleTrend.add(examplePUTInfo.responseTime);
    putExampleSuccessRate.add(examplePUTInfo.successRate == 200);
}

export function handleSummary(data) {
    console.log(data);
    return {
        'stdout': textSummary(data, { indent: ' ', enableColors: true}),
    }
}

Let me know if I can provide you with any additional information! Thanks so much for the response!!

That’s super helpful @jill.lombardi thanks a lot for that :bowing_man:

I’m currently working on trying to reproduce the issue, I’ll let you know once I know more :slight_smile:

Hi @jill.lombardi

After some debugging, I can confirm I’m able to reproduce the behavior you observed, and this is very likely to be a bug in the current version of k6. I have documented it in a GitHub issue.

The team hasn’t prioritized it yet, but being a bug, I expect we will work on this as soon as possible. I’ll let you know as soon as I have more visibility as to when we expect this to be fixed :handshake:

Hi again @oleiade ! Thanks so much for the update I appreciate it! Look forward to hearing from you again! :wave:

Hi @oleiade! I just took a look at the GitHub issue you created and wanted to add that I have also noticed this same issue for some of P(90) thresholds that we have.

Hi @jill.lombardi

Thanks for the heads-up :bowing_man: As you might have read in the GitHub issue, we have traced back the cause of the issue. We have prioritized its resolution for version 0.42, upcoming beginning of 2023.

In the meantime, we believe you should be able to solve your issue by using the 50% percentile p(50) instead of med in your scripts :+1:

Hi @oleiade

Ah gotcha…What would be the best way to implement this? I just tried by switching out med with P(50) in this line:

export let options = {
    thresholds: {
        'example_trend_response_time': ['p(50)>39', 'p(50)<95'],  
        'example_rate_successful_requests': ['rate==1'],
    },

And when I ran this, the thresholds returned as undefined.

Hi @jill.lombardi

We have opened a Pull Request implementing a fix for your issue. This was indeed a bug in our thresholds evaluation engine. We are still deciding if we will produce a v0.41.1 version for it, or if it will land in v0.42.0. I’ll let you know as soon as I have more information on that front.

If you have the time and feel comfortable enough in go to do that, I’d really appreciate if you could try the Pull Request branch and tell us if it works as expected from your perspective :bowing_man:

Hi @oleiade

Unfortunately I am using this for work purposes and my team does not feel comfortable using unreleased software :pensive: Thank you for all of your work on this nonetheless!

Hi @oleiade !

I just noticed a pattern with our results today that I thought I would share here. I am not sure if this additional info will be useful at all to you since you’ve seemed to have pinpointed the bug, but I thought I would mention it anyway.

We have a few thresholds where we are checking both the med and p(90). For example, a threshold like this would look like: 'example_trend_response_time': ['med>39', 'med<95', 'p(90)>30', 'p(90)<100']'. In a threshold like this one, it appears to properly validate both the med and the p(90). We are only seeing the issue with incorrect validation when the threshold only includes med: 'example_trend_response_time': ['med>39', 'med<95'] .

In addition to this, I wanted to mention that I was not able to get the fix where I substitute p(50) for med to work.

Interesting! Thanks for collecting this information and getting it back to us :pray:

Considering the bug we have spotted, the behavior regarding med you further describe is expected. The behavior of Median and Percentile are tightly coupled in k6, which led to this specific issue you ran into, and I would put the scenario you pointed out on the account of that specific bug too.

Good news though, the fix for this bug has been merged in k6 master, and it has been decided it will land in v0.42.0 around mid-december :slight_smile:

1 Like

Also, regarding p(50), we would expect it to work as intended, but I will need to run some tests and will get back to you :+1:

1 Like

Hey @jill.lombardi

Just a heads up that I’ve had a pretty heavy workload this last few days and didn’t get to experiment further, but I’m not forgetting this, and I shall come back to it in the next couple of days :bowing_man:

1 Like

Hi @jill.lombardi

Just a heads-up that k6 version 0.42 is out, and it contains the fix for this specific issue :tada:

2 Likes

Amazing! Thank you so much for your work on this :slight_smile:

1 Like