Explaining client errors:
- 1050 - request timeout
- 1211 - dial i/o timeout
- 1220 - read: connection reset by peer
As part of our service health evaluation tool, we look at the http responses of requests. But we don’t know how to handle response 0 (indicating client error)
The most common ones we encounter are the ones listed above. How can we further understand whether its a service-side issue, a proxy/gateway issue or a client issue? Our load agents aren’t completely optimized such as tcp re-use, increasing http ports etc, but we don’t know whether this is the limitation or something else.
I’m looking for some info on how to interpret these errors better.
Hi @jhwj9617 !
We have the documentation page where we explain these codes there is there Error Codes.
Regarding the response 0, is there a case where the Response.error_code
comes with 0? 
Cheers!
I’m familiar with that page. The ‘0’ I’m referring to is the http status which is 0, which gives a separate client error code (1050/1211/1220)
My question pertains to whether the errors are a result of service/network bottleneck, or lack of optimization on client agent (e.g. tcp re-use, starved ports). Are there separate errors for those? Or is it ambiguous if it is client issue or not.
My question pertains to whether the errors are a result of service/network bottleneck, or lack of optimization on client agent (e.g. tcp re-use, starved ports). Are there separate errors for those? Or is it ambiguous if it is client issue or not.
Unfortunately, it could be both
I mean the status could be determined only when we got something from the server, before that it’s defaulted 0
. But the error could be on the client side or still some network error. I’d say that in that case the Response.error_code
should be more explicit.
Hope that answers
Cheers!
@olegbespalov We have a Health evaluator tool which reads these metrics.
Right now we don’t have confidence if its a genuine service-side issue, or if its client side. We don’t want to flag false positives.
If we don’t have optimized clients, I’m afraid these client errors are more pronounced. For now I’m thinking we filter out client errors to avoid false-positives. And then when we have confidence that client side is rarely the issue, if at all, then we can remove the filter and count ‘0’ status as failures as part of health evaluation.
The main point is that, when evaluating service-health, we are mostly only interested in service-side health, and not client issues.
What do you think? What is the standard practice within the community?
Hey @jhwj9617
As I said, the status 0 could mean different things. It could be that there is a misconfiguration of the resources in infra or not optimized load generator (e.g. Running large tests)
I’d still recommend checking what is inside the Response.error_code
and taking some actions based on that knowledge. Maybe you even could emit a custom metric based on the Response.error_code
and monitor it 
Also, to understand your case better. Are these status 0 responses happening when trying to reach some RPS? And what’s the percentage of them?
Cheers!
Hi @olegbespalov
We do track the Response.error_code
, and the most common are the ones I listed:
- 1050 - request timeout
- 1211 - dial i/o timeout
- 1220 - read: connection reset by peer
That’s what I’m trying to get at. For these cases, do I need more information or can I make judgements about our system (client/network/service) based on it.
They come up intermittently. About <1% of the time, but often enough that it affects our target SLOs (99.95%). They show up both at low (~10RPS) and at high RPS (~1000 RPS)
@olegbespalov
Any info on how to interpret these error codes better?
Hi @jhwj9617
To be honest, both 10 RPS and 1000 RPS don’t sound high, so it’s highly likely the issue is with the subject under test (or in between), and it desires a better investigation.
- 1050 is the request timeout, Basically, k6 did a request, but there was no response in a defined timeout (by default for HTTP it’s 60s)
- 1211 k6 wasn’t even able to make an HTTP request. The target system can’t establish a TCP connection.
- 1220 This is caused by the target system resetting the TCP connection. It happens when the Load Balancer or the server itself isn’t able to handle the traffic.
So in theory, you could try to decrease the 1050, by increasing the timeout setting. However, it is worth also looking into the metrics or logs of the LB to see if there are any suspicious things.
Cheers!
1 Like