Hello Team,
I am using a simple script to post a transaction with 1000 VU’s. I am using a data store with 250k records.(single column and 250k rows )
With 100 vu’s it worked. but with 1000 VU’s - I am getting Out of memory error.
What do I do ?
Please let me know if you need any additional info.
We were running a test with a datastore that had 1.7 million records in Load Impact(Version 3.0) with out any issues.
What type of a data store are you using? I’m asking because we’ve noticed that CSV parsing with some popular JS libraries like papaparse takes up a surprisingly large amount of RAM, so if that’s the case with you, directly loading JSON files or plain text files might be a partial short-term workaround.
There are other tricks you can use to reduce k6 memory usage (like discardResponseBodies and the upcoming --compatibility-mode=base option), but these won’t fully make up for a huge static file loaded in each VU. Unfortunately, until we fix the underlying issues, we’re unlikely to support millions of datastore records with lots of VUs on the same machine So until then, you’d need either a bigger machine and/or smaller datastore files and/or less VUs per machine…
Thanks Nedyalko. I am using a CSV file. I will try using a JSON file. Even when you use a JSON file, each VU will have its own copy of that file in the memory right ? correct me if I am wrong. Could you please tell if you have any estimate by when will we have this fixed ?
Mean while I will try to work with your tricks.
I was able to run a test with 250 VU’s using a JSON file as of now - which is taking 60GB of my memory. Eventually we will need to run a test with 7500 VU’s.
In the next few months. The current priority is finally getting k6 v0.26.0 released (next Monday) and then finishing #1007 (hopefully early January). One of us will probably start working on the shared and streaming read-only memory (i.e. data stores) immediately after that. It’s probably going to take at least a few weeks, since as I pointed out in the CSV issue, there are some complexities involved and we need to design the APIs to be composable.
Hello @ned sorry to bring up a oldie, but I didn’t want to create an duplicate topic. Can you or someone in the community shed light on any enhancements in this area - CSV API · Issue #1021 · grafana/k6 · GitHub namely seed data partitioning. Maybe defining a method to only use enough unique data to complete a test by duration. is there now a way to only use partial segments of a data file rather than reading the entire copy into memory?
Hello @amruth.chintha ,
Saw your post and had to contribute a bit.
As it was mentioned, each VU will get a copy of your file and as you multiply the users, the memory requirements grows accordingly.
A solution I have used in the past was to segment my data file.
May seem rudimentary, but I would split it per vu, ending with dataFile001.csv to dataFile999.csv and load it with the vu ID. With that you would have only 250 rows per file and most probably no memory problems.
I know splitting the file may seem tedious, but a script could easily help with that.
Then I may not use the shared array…just return papaparse.parse(open('./dataFile'+${users[vu.idInTest]+'.csv'), { header: true }).data;
Some ideas, I hope that helps, but I am working on some posts to explain more ideas on how to tackle that issue.
That issue hasn’t been had any updates and due to the existence of SharedArray likely will not get implemented and very likely not in the way it was discussed 2+ years ago.
@mstoykov would you ‘not’ recommend partitioning JSON formatted datafile given SharedArray does most of this heavy lifting?
I am using SharedArray but I have had problems with EC2 instance sizing when datafiles exceed 60M. After a short period of time with smallish EC2 instances my jenkins controller would lost connectivity to the AWS instance. after investigation the ec2 instance simply crashed due to memory exhaustion. I"m trying to find the sweet spot between datafile size, executor options, ec2 instance load generator size and k6 script utilization across hundreds of service api profiles. In short I’m trying to get a close to 1 size fits all as possible.
for example - using an m5.8xl might make sense for a test suite with “heavy” POST calls and high rate GET calls approaching 12K TPS (sourced from a 3million record datafile), but using the same instance with the same datafile for a 200 TPS test suite is overkill. I want to limit how many ec2 LGs to solve different load/memory solutons.
I was hoping a nice tidy solution was in the k6 sdk :).
would you ‘not’ recommend partitioning JSON formatted datafile given SharedArray does most of this heavy lifting?
I would argue it isn’t needed. SharedArray will make it so there is only one copy of the whole data. So it should be the same as splitting it in one piece per VU. splitting the data in 2 parts to be shared by 50% of the VUs each and using SharedArray on top of that should have the same memory characteristics. If that makes the logic easier - you need to split the data in N pieces and N scenarios to work on it - go for it, but otherwise it shouldn’t matter.
exceed 60M.
What is M here? A Million data points? How big is that?
after investigation the ec2 instance simply crashed due to memory exhaustion
Are you certain that all the opening and processing of the data happens inside the SharedArray?
POST calls
Uploading data with k6 is currently not very optimized as has been discussed in this issue. You might need to try different ways of building the body and maybe caching it which might make the upload … more performant.
I was hoping a nice tidy solution was in the k6 sdk :).
Given that this is dependent on how the script will behave during execution - there are no magic bullets, sorry.
I would also recommend opening new topics when you have questions instead of “resurrecting” 2 years old ones
I would suggest creating a web server using https://gin-gonic.com/ or any other framework and exposing an endpoint http://localhost:8899/getdata.
You can call this endpoint in your script to get the data into a virtual user.
The program should read the CSV into an array and keep sending new data once a request is made to the endpoint.
Gin can easily handle 5K requests/sec and around 20 lines of Go code for creating the webserver.
I’m testing a k6 script with 2600 ccu running in 3600s, 15gb Ram but I run it for about 20 minutes and it runs out of ram.
I used discardResponseBodies, my stream has about 30 api including using metrix Gauge, Counter, Trend, Rate to custom report.
Is there any way to reduce the amount of CPU?