Massive spike in reqs and VUs; 10% failed requests and up to 30 second duration

Good afternoon, k6 community! I wanted to share an issue I am currently experiencing with a test script we are developing. I will explain it as best I can, but if you have any questions, or I am clearly missing details, please ask or let me know.

We are running the script from github actions. The workflow is running on Ubuntu 20.04 with a 2-core CPU, 7GB of ram, and 14GB of SSD space. I will break this down between an older version of the script, that is working without any issues, and a more recent version that contains xk6, a sql db, and some abstraction by way of helper files. I would also like to add that I am systematically adding things to the functioning script to see what might be causing it to go so wild.

Here is a graph for both the run with the older script and the newer script:

It should be noted that all things in regards to length and load are equal between the tests. Same environment, same ramp up, same sustained load, same ramp down, etc…

As it must already be clear, the graph on the far left is our older script. Quite stable, with zero failed requests out of about 75,000 total over the course of an hour. The total VU’s by the end of the test remained stable at about 56 (split between different requests)

The graph on the right is from a run of our latest script. The numbers were strange and alarming:

There haven’t been any changes to the services we are testing, and we managed a successful test with zero failed requests and good request durations, with our original script, after running running three tests with our latest script just before it.

As a contrast, here is a run using the original script (This run is cut in half, so don’t be alarmed by the number of requests. I just didn’t save a screen grab of the original, full tests.):

We aren’t quite sure why this would be happening, but we suspect it is memory related. Is it possible this is a memory issue? Are any of these signs an indication of it? The original test had most of the code within the script itself (this includes code that randomly generates request bodies). Over time, as we have expanded to test more services, I have moved a lot of this code over to helper files that I import into the script. Side question, but does anyone know of a way to monitor memory usage inside of a github workflow?

The difference between the tests indeed seems alarming, but without information exactly what was changed, I can’t help you with diagnosing that problem, sorry. Though I am pretty sure that just memory issues probably won’t account for the huge increase in request duration and errors :thinking:

Regarding memory usage in github actions, you can probably run some bash script in the background that measures the k6 memory usage and pipes it to a file while k6 runs. See these for example:

Thank you for your reply. I will look into monitoring memory usage and circle back with another post that has more details regarding differences between the scripts.

Good morning. After some brute force (unfortunately) testing, we think we have isolated the issue down to a helper function we’ve imported into the script. The function helps to generate randomized parts of a request body and is used in two different scenarios, for the same service, at the same time (At different iter rates).

The function is imported at the top:

Screenshot 2022-08-17 082013

Like I mentioned earlier, they are used in two different export functions, for the 2 scenarios, at a rate of about 75k requests an hour (3/4 generated by one, and the rest by the other). One of the values passed in as an argument is from a sharedArray params file.

Here is the code:

(I made the names generic to avoid sharing unnecessary and sensitive details)
We’ve reduced the function down to returning a hardcoded string without any logic inside the body, and that seemed to work. However, whenever we use this function, at around the 14 minute and 30 seconds mark, like clockwork, we start to see a MASSIVE spike in VUs being created as well as requests and, of course, many 503s. The service has not gone down. Whenever the tests are run without this function, they will pump out the requests and not a single one will fail. I am new to a lot of this, so please excuse me if there are details that I should include that I have not. Do you have any thoughts on this?

Well, for anyone interested… Looks like the issue was the template literal. In all of the many ways that we went about isolating all the potential issues, it has come down to that. Is it a memory issue having to do with es6 being transpiled to es5 before it is transpiled to go? Has anyone ever seen anything like this?


Looks like the issue was the template literal

does removing it and using concatenation help?

From my testing there is no difference in performance between using concatenation or template literals especially for example such as yours.

It will really be helpful if you can give us a script that is unreasonably slow and using so we can reproduce and see what is beign slow.

Is it a memory issue having to do with es6 being transpiled to es5 before it is transpiled to go?

No idea, but es6->es5 transpilation is used by less and less stuff this days (and hopefully soon not at all) and this rarely has been giving anyone trouble in the last … year+.

And there is no es5 transpilation to go, we are running/interpreting the JavaScript with goja.

But ultimately we will need a contained example of the problem that we can run in order to investigate it.

Yeah, if you have a self-contained example with you can share, that will be awesome. Though I also have a few other comments:

This part is very confusing to me. Even if there was some sort of a memory leak or other bug in k6, I don’t see how this will cause you to get 503 errors :confused: This is a server error, k6 can’t conjure it out of thin air on its own…

How are you sure of this, how are you monitoring this service? Also, do you maybe have a load balancer or some kind of proxy in front of your service that the requests pass through? Does the semi-randomized request you are generating maybe trigger different parts of your code you weren’t testing before?

Maybe responses were cached before and now suddenly they aren’t and are hitting the actual backend and its slow paths?

While a k6 script could be written poorly and have performance problems, the generateRequest() function you showed looks fairly innocuous and I don’t think can be the cause of them. But even if it or something else was causing k6 to perform poorly, I don’t see how this can cause 503 errors or even such a huge increase in the reported response times.

Instead, from your screenshots and explanations, everything points to the fact that your latest code uncovered some problem with the system you are testing. k6 is spinning up more VUs because request responses suddenly start taking longer, the sever is completely failing for some requests and returning 503, etc.

How sure you are that the issue is in k6 and why are you sure about that?

Good morning, @mstoykov. Thank you for your response.

Regarding your first question, yes. We ran the test in about 20 or so different scenarios. Since we weren’t quite sure how we’d monitor the github hosted runner for resource utilization, we took a more brute-force approach (clearly not ideal) and ran the script with small changes until we landed on our solution. As it stands right now, we haven’t experienced any issues running our script since discovering the issue and making the changes.

As far as the second half of your response goes- I will take your suggestion and share a script, though I will need to scrub it for sensitive info, and that may not be until much later today. Thanks for the clarification on my understanding of what happens just before runtime.

@ned, thank you for your reply and good morning.

I am equally confused. I wish I could provide a more sophisticated answer to your questions. All things being equal within the script and with the load we are generating, when I use a template literal to concatenate several values, the test just goes haywire at the same exact point.

I can’t say with 100% certainty that the services did not crash or experience any blips. I suppose it is a fair question, as I did not personally check, but instead asked the personnel responsible for our AWS services to investigate. As it stands, eliminating the template literal from my code has the test running as it should.

How sure you are that the issue is in k6 and why are you sure about that?

I suppose you are correct. I can’t be certain it is k6. However, given that fact I have 3 consecutive successful tests ran under a load that is 3x our historical peak, without issues, after removing the template literal, I suppose all I can wonder is why.

Thank you both for your responses. @mstoykov, I’ll get a script to you at some point tonight or tomorrow.

1 Like

I have to admit, this is one super weird issue :sweat_smile: