K6 tests interrupted on Blazemeter run

Hi,

I wanted to post about an issue that I’m seeing when I run k6 test scripts on Blazemeter.

Whenever I upload my test files to Blazemeter, the k6 test script is getting interrupted with 100+ users resulting in the test to stop executing.

My test case is an average load tests, meaning that the test has a short ramp up period (10-30) and duration is short (1-5 mins). Locally the test script runs fine without any issues. Has anyone encountered this problem?

We’ve been debugging infrastructure and the test script but haven’t found anything conclusive so far.

Test script:

/* eslint-disable @typescript-eslint/ban-ts-comment */
import { sleep } from "k6";
import http from "k6/http";
import { SharedArray } from "k6/data";
/* @ts-ignore */
import papaparse from "https://jslib.k6.io/papaparse/5.1.1/index.js";
/* @ts-ignore */
import { randomItem } from "https://jslib.k6.io/k6-utils/1.1.0/index.js";

const csvData = new SharedArray("Tokens data array", function () {
  return papaparse.parse(open("hhh-users-dev.csv"), { header: true }).data;
});

const userToken = randomItem(csvData);
console.log("User Token", userToken.token);

const baseUrl = "https://test.vmo2digital.co.uk/";
const route = "mobile";

const serviceStatusUrl = `${baseUrl}${route}/servicestatus`;
const homeUrl = `${baseUrl}${route}/home`;
const customerUrl = `${baseUrl}${route}/customer`;
const ordersUrl = `${baseUrl}${route}/orders`;

export default () => {
  const date = new Date().toISOString();

  const params = {
    headers: {
      "Accept-Version": "1.x",
      Authorization: userToken.token,
      "DAPI-ChannelID": "HHH",
      "DAPI-CorrelationID": "307909bf-b0d5-47aa-8ab6-c859246c66b3",
      "DAPI-RequestID": "307909bf-b0d5-47aa-8ab6-c859246c66b4",
      "DAPI-RequestTimestamp": date,
      "vmmd-senderuri": "MVMAANDROID",
      "x-downstream-switch": "use-legacy",
    },
  };

  http.get(homeUrl, params);
  http.get(serviceStatusUrl, params);
  http.get(customerUrl, params);
  http.get(ordersUrl, params);

  sleep(2);
};

Local test run:


running (1m53.0s), 100/100 VUs, 2601 complete and 0 interrupted iterations
default   [  94% ] 100 VUs  1m53.0s/2m0s

running (1m54.0s), 100/100 VUs, 2642 complete and 0 interrupted iterations
default   [  95% ] 100 VUs  1m54.0s/2m0s

running (1m55.0s), 100/100 VUs, 2667 complete and 0 interrupted iterations
default   [  96% ] 100 VUs  1m55.0s/2m0s

running (1m56.0s), 100/100 VUs, 2702 complete and 0 interrupted iterations
default   [  97% ] 100 VUs  1m56.0s/2m0s

running (1m57.0s), 100/100 VUs, 2711 complete and 0 interrupted iterations
default   [  97% ] 100 VUs  1m57.0s/2m0s

running (1m58.0s), 100/100 VUs, 2724 complete and 0 interrupted iterations
default   [  98% ] 100 VUs  1m58.0s/2m0s

running (1m59.0s), 100/100 VUs, 2736 complete and 0 interrupted iterations
default   [  99% ] 100 VUs  1m59.0s/2m0s

running (2m00.0s), 100/100 VUs, 2772 complete and 0 interrupted iterations
default   [ 100% ] 100 VUs  2m00.0s/2m0s

running (2m01.0s), 089/100 VUs, 2784 complete and 0 interrupted iterations
default ↓ [ 100% ] 100 VUs  2m0s

running (2m02.0s), 046/100 VUs, 2827 complete and 0 interrupted iterations
default ↓ [ 100% ] 100 VUs  2m0s

running (2m03.0s), 008/100 VUs, 2865 complete and 0 interrupted iterations
default ↓ [ 100% ] 100 VUs  2m0s

     data_received..................: 19 MB  152 kB/s
     data_sent......................: 1.3 MB 10 kB/s
     http_req_blocked...............: avg=11.97ms min=0s      med=0s    max=3.14s    p(90)=1µs    p(95)=1µs    
     http_req_connecting............: avg=1.86ms  min=0s      med=0s    max=371.48ms p(90)=0s     p(95)=0s     
     http_req_duration..............: avg=1.96s   min=31.61ms med=1.32s max=17.81s   p(90)=4.03s  p(95)=5.5s   
       { expected_response:true }...: avg=1.96s   min=31.61ms med=1.32s max=17.81s   p(90)=4.03s  p(95)=5.5s   
     http_req_failed................: 0.00%  ✓ 0         ✗ 11492
     http_req_receiving.............: avg=2.71ms  min=5µs     med=29µs  max=253.89ms p(90)=9.84ms p(95)=13.98ms
     http_req_sending...............: avg=32.6µs  min=7µs     med=28µs  max=209µs    p(90)=56µs   p(95)=67µs   
     http_req_tls_handshaking.......: avg=9.11ms  min=0s      med=0s    max=2.67s    p(90)=0s     p(95)=0s     
     http_req_waiting...............: avg=1.96s   min=30.84ms med=1.31s max=17.81s   p(90)=4.03s  p(95)=5.49s  
     http_reqs......................: 11492  92.887212/s
     iteration_duration.............: avg=4.24s   min=1.04s   med=3.88s max=19.56s   p(90)=6.25s  p(95)=7.49s  
     iterations.....................: 2873   23.221803/s
     vus............................: 8      min=8       max=100
     vus_max........................: 100    min=100     max=100

running (2m03.7s), 000/100 VUs, 2873 complete and 0 interrupted iterations
default ✓ [ 100% ] 100 VUs  2m0s

Test run on Blazemeter:
Partial log:

running (0m13.9s), 100/100 VUs, 14 complete and 0 interrupted iterations
default   [  12% ] 100 VUs  0m13.9s/2m0s

running (0m14.9s), 100/100 VUs, 48 complete and 0 interrupted iterations
default   [  12% ] 100 VUs  0m14.9s/2m0s

running (0m15.9s), 100/100 VUs, 97 complete and 0 interrupted iterations
default   [  13% ] 100 VUs  0m15.9s/2m0s

running (0m16.9s), 100/100 VUs, 105 complete and 0 interrupted iterations
default   [  14% ] 100 VUs  0m16.9s/2m0s

running (0m17.9s), 100/100 VUs, 105 complete and 0 interrupted iterations
default   [  15% ] 100 VUs  0m17.9s/2m0s

     data_received..................: 2.6 MB 143 kB/s
     data_sent......................: 270 kB 15 kB/s
     http_req_blocked...............: avg=99.43ms  min=127ns    med=307ns   max=1.31s    p(90)=426.96ms p(95)=713.62ms
     http_req_connecting............: avg=14.04ms  min=0s       med=0s      max=795.66ms p(90)=48.57ms  p(95)=96.8ms  
     http_req_duration..............: avg=9.75s    min=394.17ms med=10.91s  max=14.76s   p(90)=13.55s   p(95)=14.06s  
       { expected_response:true }...: avg=9.75s    min=394.17ms med=10.91s  max=14.76s   p(90)=13.55s   p(95)=14.06s  
     http_req_failed................: 0.00%  ✓ 0         ✗ 491  
     http_req_receiving.............: avg=81.18µs  min=17.05µs  med=33.77µs max=1.27ms   p(90)=226.55µs p(95)=268.27µs
     http_req_sending...............: avg=119.85µs min=17.04µs  med=37.14µs max=16.9ms   p(90)=83.49µs  p(95)=105.71µs
     http_req_tls_handshaking.......: avg=73.63ms  min=0s       med=0s      max=1.12s    p(90)=274ms    p(95)=559.44ms
     http_req_waiting...............: avg=9.75s    min=394.04ms med=10.91s  max=14.76s   p(90)=13.55s   p(95)=14.06s  
     http_reqs......................: 491    26.797839/s
     iteration_duration.............: avg=13.57s   min=1.88s    med=14.78s  max=16.63s   p(90)=15.74s   p(95)=15.9s   
     iterations.....................: 114    6.221902/s
     vus............................: 100    min=100     max=100
     vus_max........................: 100    min=100     max=100


running (0m18.3s), 000/100 VUs, 114 complete and 900 interrupted iterations
default ✗ [  15% ] 100 VUs  0m18.3s/2m0s

Hi @sherrylenegauci

Thanks for the additional context.

For future reference, I’ll add more details discussed in the community slack (we lose the story there, and it’s not easily searchable).

I have a k6 test script that I’m running on Blazemeter, however I’m seeing the iterations being interrupted when running high loads (for example 200 concurrent users) on Blazemeter. I’m trying to pinpoint whether the issue is coming from the script or not, mainly with the configuration that I’m testing with.

Following: https://www.blazemeter.com/blog/k6-load-testing

I tried to replicate from my laptop a few times (installed taurus cli aswell) and I don’t see the test runs interrupted

I was in conversations with Blazemeter support and we’ve hit at a dead end at the moment. They said that everything looks fine from their end and they were asking me to check the test script. Just wondering whether there would a be a (code) reason as to why tests could get interrupted

At first glance from the results, the latency (http_req_duration ) you get when executing via Blazemeter is way higher than locally.

A couple of additional questions to see if we can help you pinpoint this issue:

  • Can you share the Options? How do you define the scenarios, and do you have any thresholds defined?
  • What version of k6 are you running locally and on Blazemeter? Is it the latest and the same?

Cheers!

Hi @sherrylenegauci

Since you mentioned on Slack you have a different version on your local Mac, and where you run Blazemeter, if you want to make sure this is not related, you could try two things:

  • The easiest, probably, if you have Docker installed in your Mac, is to run the docker image with the test docker run --rm -i grafana/k6:0.45.0 run - <test.js. If the test has some environment variables, you might need to mount the volume and run the test from that volume, so those are also included in the run. E.g. docker run --rm -i -v "${PWD}:/tests" grafana/k6:0.45.0 run /tests/test.js.
  • You can use the trick described in How To Install Older Versions of Homebrew Packages - Nelson Figueroa, as I did, to download a version of k6 (I was able to find 0.45.1 in https://github.com/chenrui333/homebrew-core/blob/bump-k6-0.45.1/Formula/k6.rb). That’s probably a bit more cumbersome. If you have Docker, it’s easier to run with Docker to test with a concrete version.

That said, based on the output from both executions, where you have plenty of dropped iterations when running with Blazemeter, I would look that way. What executors are you running? What are the Options you define in the test? Why is the latency of the endpoint under test higher when run from Blazemeter, compared to local, is it caused by the network, or do you also see higher response times in the endpoint depending on the test run origin?

Cheers!

Hi @eyeveebee

Thanks so much for the information.

Yes I have noticed that the latency from the Blazemeter test run is significantly higher. I think this may be because the tests are run on an AWS server and the requests are routed a little differently. That has been the feedback from the Blazemeter team.

I do have some configurations set as part of a .yml file. Unfortunately, I cannot set options within the script itself as Blazemeter doesn’t seem to be able to interpret them and throws some warnings.

execution:
  - executor: k6
    concurrency: 200
    ramp-up: 30s #5-15% ramp up of hold-for value
    hold-for: 3m #duration of test
    throughput: 2 #rps
    steps: 1
    scenario: hhh-api-dev
    locations:
      eu-west-2: 1

The tresholds defined at the moment are the below:

reporting:
  - module: passfail
    criteria:
      - fail>7%
      - subject: p95
        condition: '>' 
        threshold: 5s
        stop: false

I confirmed that the k6 package version is the same on my machine aswell as the server executing the tests.

Hi @sherrylenegauci

Thanks for sharing the configuration. I guess Blazemeter chooses the scenario to execute, and we can’t be sure which one it uses. I guess it’s using a Ramping VUs executor under the hood, though we don’t know how they map the Blazemeter configuration to k6.

Interestingly, you have defined thresholds, which is probably why the test stops on the server side, while not locally. Based on Dropped iterations, I would look at:

With constant-arrival-rate and ramping-arrival-rate, iterations drop if there are no free VUs. If it happens at the beginning of the test, you likely just need to allocate more VUs. If this happens later in the test, the dropped iterations might happen because SUT performance is degrading and iterations are taking longer to finish.

A few things you can do to test what is the root cause.

  • Remove the passfail criteria, leaving the rest of the configurations, and see if the test finishes. If it finishes, it’s clear it’s because the endpoint is taking longer than expected, and after 20s k6 already knows the test is a fail.
    • Try again setting threshold above the latency Blazemeter sees from your system under test ( p95 14.06s). Set a threshold of 20s or 25s, and see if the test executes. If it does, it’s k6 deciding that the test will fail, because already the fail is over 7% of the executed requests.
  • If the test fails at the start still, without passfail or a higher threshold, you might need more VUs.
    • I can’t say how Blazemeter configures the number of VU, it might be related to the concurrency. We don’t know how this configuration translates in their case.
    • Hopefully, the Blazemeter team will be able to help if you share the documentation Dropped iterations and Running large tests.

That said, the difference in latency is what is concerning, and what should be addressed. Testing the above will only allow you to understand why it fails on Blazemeter servers and not locally.

If you are expecting a 5s (p95) latency, and clearly you can get that executing Blazemeter locally… the Blazemeter team should be able to check if they have a network bottleneck, or similar, when running the test in their cloud. Maybe they are running on containers with a bandwidth restriction, the routes to the endpoint you are testing are slower than from locally, etc.

You could also try to reduce bandwidth with Running large tests, setting discardResponseBodies: true,. Though, since Blazemeter locally works well, and on the server side shows a higher latency, I would ask the Blazemeter team to help locate the root cause instead of trying to bypass it. If they find where the bottleneck is, when the requests come from their servers, the root cause can be addressed.

I hope this helps :bowing_woman: