K6 runners fail on high rps

Hi all,
I am able to run load test via k6 operator for smaller loads ie rps however when i try to run for larger rps and high number of parallelism i see failures in load test jobs with error
Job has reached the specified backoff limit
Here are the specs I am putting in my k6 operator :

 resources:
                limits:
                  cpu: 1000m
                  memory: 4096Mi
                requests:
                  cpu: 500m
                  memory: 1024Mi  

and my load test scenarios :

export const options = {
  scenarios: {
    ramping_arrival_rate: {
      executor: 'ramping-arrival-rate',
      startRate: 20, // Start with 10 iterations/second
      timeUnit: '1s', // RPS will be calculated per second
      preAllocatedVUs: 10000, // Preallocate 50 VUs
      stages: [
        { target: 100, duration: '2m' },   // Ramp up to 500 RPS in 3 minutes
        { target: 1000, duration: '4m' },  // Ramp up to 1000 RPS in the next 2 minutes
        { target: 3000, duration: '3m' },  // Ramp up to 1500 RPS in the next 2 minutes
        { target: 0, duration: '1m' },     // Ramp down to 0 RPS over 2 minutes
      ],
    },
  },
};```
i try to run these tests on 10 runners 
I get the following response  when i try this command
**kubectl describe job load-test-3 -n argo**

Image: 578061096415.dkr.ecr.us-east-1.amazonaws.com/load-test:1.0.0
Port: 6565/TCP
Host Port: 0/TCP
Command:
k6
run
–quiet
–execution-segment=2/10:3/10
–execution-segment-sequence=0,1/10,2/10,3/10,4/10,5/10,6/10,7/10,8/10,9/10,1
–tag
testid=customer-exact-search-2024-10-23
/test/customer-exact-search.js
–address=0.0.0.0:6565
–paused
–tag
instance_id=3
–tag
job_name=load-test-3
Limits:
cpu: 1
memory: 4Gi
Requests:
cpu: 500m
memory: 1Gi
Liveness: http-get http://:6565/v1/status delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:6565/v1/status delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
DB_USERNAME: qqeq
DB_PASSWORD: qeqeq
DB_DATABASE: qqeqeq
DB_HOST: rrr
API_BASE_URL: api.mykaarma.com
TEST_DEALER_UUID: a1dde11c73eca65b529b6125b98c0088dba81e7967f261e5601b0dc64e498ea4
TEST_DEPARTMENT_UUID: ttt
SERVICE_SUSBCRIBER_USERNAME: vhqU7KRiuEy1qfU4TEiYHP-E2zSJgz2CPKEedUPSCwo
SERVICE_SUSBCRIBER_PASSWORD: k1_TunYv4vrgtGCGtc-CFqF99ER2Me8FbVOMwZlIY_M
TEST_DEALER_ID: 1
Mounts:
/test from k6-test-volume (rw)
/tmp from logfolder (rw)
Volumes:
k6-test-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: scripts
Optional: false
logfolder:
Type: HostPath (bare host directory volume)
Path: /var/log/kaarya
HostPathType:
Events:
Type Reason Age From Message


Normal SuccessfulCreate 3m11s job-controller Created pod: load-test-3-5q9h9
Warning BackoffLimitExceeded 3m6s job-controller Job has reached the specified backoff limit


can any one tell me what could be possible reason ? also  one tihng i noticed is that i see an extra pod load-test-starter when everything works fine but not when the load-job- pods fails 
![Screenshot 2024-10-23 at 5.22.47 PM|690x374](upload://wQyX6sn4qCZWVHvKQbvtpi9fJzn.png)

Hi @vipulkhullar,
I suggest to check the description of the runner pods as well. Job has failed because the pod has failed; the question is why. With the info you provided, my guess is that the pods couldn’t be scheduled because of insufficient resources: that can happen when one tries to run a larger test and a cluster is not “ready” for it.

Hope that helps!

thanks for the reply !
I used the command kubectl describe -n argo and got the following :

Name:             load-test-10-m5c27
Namespace:        argo
Priority:         0
Service Account:  default
Node:             ip-192-168-229-59.ec2.internal/192.168.229.59
Start Time:       Sat, 26 Oct 2024 03:46:10 +0000
Labels:           app=k6
                  batch.kubernetes.io/controller-uid=50ff2fdd-b113-4b78-80c1-1a6ac66c7855
                  batch.kubernetes.io/job-name=load-test-10
                  controller-uid=50ff2fdd-b113-4b78-80c1-1a6ac66c7855
                  job-name=load-test-10
                  k6_cr=load-test
                  runner=true
Annotations:      <none>
Status:           Failed
IP:               192.168.193.177
IPs:
  IP:           192.168.193.177
Controlled By:  Job/load-test-10
Containers:
  k6:
    Container ID:  containerd://2c356d35bc35da13e9b13c955d4a58f76ef216364022df172a7f0d299a92a170
    Image:         578061096415.dkr.ecr.us-east-1.amazonaws.com/load-test:1.0.0
    Image ID:      578061096415.dkr.ecr.us-east-1.amazonaws.com/load-test@sha256:98ed72d8224a8c0b9529cb33816d260e57cc1fbdb37ca286db4dc561076d0e11
    Port:          6565/TCP
    Host Port:     0/TCP
    Command:
      k6
      run
      --quiet
      --execution-segment=9/10:1
      --execution-segment-sequence=0,1/10,2/10,3/10,4/10,5/10,6/10,7/10,8/10,9/10,1
      --tag
      testid=customer-exact-search-2024-10-23
      --log-output=stdout
      /test/customer-exact-search.js
      --address=0.0.0.0:6565
      --paused
      --tag
      instance_id=10
      --tag
      job_name=load-test-10
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 26 Oct 2024 03:46:12 +0000
      Finished:     Sat, 26 Oct 2024 03:46:12 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  2Gi
    Requests:
      cpu:      100m
      memory:   1Gi
    Liveness:   http-get http://:6565/v1/status delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:6565/v1/status delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DB_USERNAME:                  qqeq
      DB_PASSWORD:                  qeqeq
      DB_DATABASE:                  qqeqeq
      DB_HOST:                      rrr
      API_BASE_URL:                 qa-api.mykaarma.com
      TEST_DEALER_UUID:             a1dde11c73eca65b529b6125b98c0088dba81e7967f261e5601b0dc64e498ea4
      TEST_DEPARTMENT_UUID:         ttt
      SERVICE_SUSBCRIBER_USERNAME:  vhqU7KRiuEy1qfU4TEiYHP-E2zSJgz2CPKEedUPSCwo
      SERVICE_SUSBCRIBER_PASSWORD:  k1_TunYv4vrgtGCGtc-CFqF99ER2Me8FbVOMwZlIY_M
      TEST_DEALER_ID:               1
    Mounts:
      /test from k6-test-volume (rw)
      /tmp from logfolder (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xrbj6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  k6-test-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      scripts
    Optional:  false
  logfolder:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/kaarya
    HostPathType:  
  kube-api-access-xrbj6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age   From               Message
  ----     ------       ----  ----               -------
  Normal   Scheduled    70s   default-scheduler  Successfully assigned argo/load-test-10-m5c27 to ip-192-168-229-59.ec2.internal
  Warning  FailedMount  69s   kubelet            MountVolume.SetUp failed for volume "k6-test-volume" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  69s   kubelet            MountVolume.SetUp failed for volume "kube-api-access-xrbj6" : failed to sync configmap cache: timed out waiting for the condition
  Normal   Pulled       68s   kubelet            Container image "578061096415.dkr.ecr.us-east-1.amazonaws.com/load-test:1.0.0" already present on machine
  Normal   Created      68s   kubelet            Created container k6
  Normal   Started      68s   kubelet            Started container k6

also one more thing if out of 10 pods anyone fails the other pods do not run the js script at all is it intended ?

for the pods which is in running state but not executing script

> kubectl describe pod load-test-6-rwgtx  -n argo
Name:             load-test-6-rwgtx
Namespace:        argo
Priority:         0
Service Account:  default
Node:             ip-192-168-229-239.ec2.internal/192.168.229.239
Start Time:       Sat, 26 Oct 2024 03:46:09 +0000
Labels:           app=k6
                  batch.kubernetes.io/controller-uid=5e90baea-0749-4543-a94c-d7ede832aee0
                  batch.kubernetes.io/job-name=load-test-6
                  controller-uid=5e90baea-0749-4543-a94c-d7ede832aee0
                  job-name=load-test-6
                  k6_cr=load-test
                  runner=true
Annotations:      <none>
Status:           Failed
IP:               192.168.206.80
IPs:
  IP:           192.168.206.80
Controlled By:  Job/load-test-6
Containers:
  k6:
    Container ID:  containerd://937452e438937294cee59e725e7acfba12e7ca9e94b40b0f87bf482afd2fb5ed
    Image:         578061096415.dkr.ecr.us-east-1.amazonaws.com/load-test:1.0.0
    Image ID:      578061096415.dkr.ecr.us-east-1.amazonaws.com/load-test@sha256:98ed72d8224a8c0b9529cb33816d260e57cc1fbdb37ca286db4dc561076d0e11
    Port:          6565/TCP
    Host Port:     0/TCP
    Command:
      k6
      run
      --quiet
      --execution-segment=5/10:6/10
      --execution-segment-sequence=0,1/10,2/10,3/10,4/10,5/10,6/10,7/10,8/10,9/10,1
      --tag
      testid=customer-exact-search-2024-10-23
      --log-output=stdout
      /test/customer-exact-search.js
      --address=0.0.0.0:6565
      --paused
      --tag
      instance_id=6
      --tag
      job_name=load-test-6
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 26 Oct 2024 03:46:11 +0000
      Finished:     Sat, 26 Oct 2024 03:46:11 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     200m
      memory:  2Gi
    Requests:
      cpu:      100m
      memory:   1Gi
    Liveness:   http-get http://:6565/v1/status delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:6565/v1/status delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DB_USERNAME:                  qqeq
      DB_PASSWORD:                  qeqeq
      DB_DATABASE:                  qqeqeq
      DB_HOST:                      rrr
      API_BASE_URL:                 qa-api.mykaarma.com
      TEST_DEALER_UUID:             a1dde11c73eca65b529b6125b98c0088dba81e7967f261e5601b0dc64e498ea4
      TEST_DEPARTMENT_UUID:         ttt
      SERVICE_SUSBCRIBER_USERNAME:  vhqU7KRiuEy1qfU4TEiYHP-E2zSJgz2CPKEedUPSCwo
      SERVICE_SUSBCRIBER_PASSWORD:  k1_TunYv4vrgtGCGtc-CFqF99ER2Me8FbVOMwZlIY_M
      TEST_DEALER_ID:               1
    Mounts:
      /test from k6-test-volume (rw)
      /tmp from logfolder (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tmjnd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  k6-test-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      scripts
    Optional:  false
  logfolder:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/kaarya
    HostPathType:  
  kube-api-access-tmjnd:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                           Age    From               Message
  ----     ------                           ----   ----               -------
  Normal   Scheduled                        7m11s  default-scheduler  Successfully assigned argo/load-test-6-rwgtx to ip-192-168-229-239.ec2.internal
  Warning  FailedToRetrieveImagePullSecret  7m9s   kubelet            Unable to retrieve some image pull secrets (dockercred); attempting to pull the image may not succeed.
  Normal   Pulled                           7m9s   kubelet            Container image "578061096415.dkr.ecr.us-east-1.amazonaws.com/load-test:1.0.0" already present on machine
  Normal   Created                          7m9s   kubelet            Created container k6
  Normal   Started                          7m9s   kubelet            Started container k6```

kubectl describe node on the node where the failed pod was created

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests         Limits
  --------           --------         ------
  cpu                2230m (28%)      4100m (51%)
  memory             2378304Ki (15%)  7083136Ki (47%)
  ephemeral-storage  0 (0%)           0 (0%)
  hugepages-1Gi      0 (0%)           0 (0%)
  hugepages-2Mi      0 (0%)           0 (0%)
  hugepages-32Mi     0 (0%)           0 (0%)
  hugepages-64Ki     0 (0%)           0 (0%)

Hi @vipulkhullar,

if out of 10 pods anyone fails the other pods do not run the js script at all is it intended ?

Yes, it is expected. The test is not started until all runner pods are scheduled and ready.

By the above, it appears that your setup is having two errors at once: one with the failure to mount the configmap (script) and one with pulling the image / secrets. IMHO, both of these errors are showing some kind of issue with the cluster. Perhaps it is the issue with connectivity, given that image had been pulled prior to the error. It makes sense to focus on resolving those issues with the cluster setup before trying to run k6-operator tests.

but the same tests run well when parallelism is defined as 2 . I am not sure why the config map and image pull error come, i event get readiness probe errors as well in running pods . Also the from the response of the node description command it seems the node has enough CPU and memory to accommodate one k6 runner pod in it .