I’m hosting a set of services in Kubernetes in GKE. These services are all behind CloudFlare (and are otherwise not available publicly). We’re using preemptible nodes, but have taken quite a lot of steps to ensure we minimise downtime caused by preemptions.
WorldPing is one of 3 services we’re using to ping our services. CloudFlare and Google are the other 2 services. WorldPing does ping the most frequently out of the 3 for the time being (30s vs 60s).
Our current uptime tends to sit around 99.95% roughly, ±0.05% or so. While not bad, it’s basically guaranteed to be around this level at the moment because of the nature of our infrastructure - that’s okay, but I do want to try figure out if there are more things I could be doing in this Kubernetes cluster to improve the uptime.
To do that however, I need to be able to understand more about what’s going on. We get messages like these in the WorldPing events that I’m currently confused about (along with some others that I’m not confused about):
- error resolving hostname. failed to resolve hostname to IP.
- Our services are fronted by Cloudflare - it’s Cloudflare’s DNS and hostname that should be being used, so how I’m just confused about how it can fail so often? (We’re a paying Cloudflare customer too).
- error resolving hostname. failed to resolve hostname to valid IP.
- I’m not sure about how this differs from the above message. Does it mean it was able to get some kind of DNS response, but one that seemed invalid? If so, I’m not sure how, it should be Cloudflare’s DNS being hit - this error and the above are both quite common too.
- error connecting. dial tcp 188.8.131.52:443: connect: no route to host
- That’s a Cloudflare IP it’s complaining about. Again quite a common error. I find it hard to believe it’s an issue with Cloudflare though given how many websites they power, etc.
- tls handshake error. read tcp 184.108.40.206:50010->220.127.116.11:443: i/o timeout
- Another one I’m just not sure about. Cloudflare should be available 99.99% of the time if not more, but it’s another fairly common error we see, and surely for Cloudflare to know where to route the request it has to start this handshake process off, so I don’t understand how it can be failing.
Basically all of these issues are confusing me because we’re not seeing any Cloudflare downtime being reported, yet they all seem like issues that should only occur if either Cloudflare is having issues, or WorldPing is having issues.
On top of the above, Cloudflare seems to think (to my own surprise) that we have had 100% uptime for the past week or so since I turned it on. Google seems to think we’re generally around 99.97% or higher. To me, this could be explained by both not encountering the above issues maybe?
Are these messages symptoms of a different issue? Is my understanding of the error messages wrong? Does WorldPing have extremely common downtime itself?
Any assistance would be greatly appreciated!