I am attempting to install Grafana dashboards via configuration files when starting up a cluster. A few months ago, I encountered persistent failures to download and install the dashboards via their ID. I was also learning to provision Linkerd, all of LGTM, and some other tools so I assumed I was doing something wrong. Starting yesterday, though, the errors returned.
logger=provisioning.dashboard type=file name=default t=2024-08-27T04:31:44.123942723Z level=error msg="failed to load dashboard from " file=/var/lib/grafana/dashboards/default/top-line.json error=EOF
The EOF error is caused by the download-dashboard container failing to request dashboard data and thus producing empty text files. I thought that perhaps I was being load limited because I’ve been doing so much experimenting, but switching to a VPN seemed to have no effect.
A couple months ago when I was first encountering this issue, putting the ID manually into the Grafana UI would actually cause the dashboard to hard crash. Now, it does not crash, but it takes double-digit seconds for the request to succeed and the dashboard to load.
So yeah. That’s where I am. All of the dashboards inside of my Helm values file are failing to download. Nothing else appears to have changed.
I am running Windows 11 with Microk8s and Multipass as my VM manager. Everything is up-to-date. I experience the same behavior on two different systems, one AMD and the other Intel.
I init Microk8s, install the LGTM stack, include dashboards in my Helm values file, and they all fail to install.
I’ll give that a shot. I’m not sure how to mimic it since this process uses a curl container that starts up specifically to request external data then shuts itself down.
So I have dived into Wireshark for the first time in my career and it appears that the DNS lookup succeeds, the request is made, but the response from Grafana is being blocked as Destination unreachable (Host administratively prohibited). This indicates that it’s being blocked by my firewall. This is at least an insight, but I still do not understand why this is intermittent, affects both my systems at the same time, or why it is happening at all since Hyper-V is explicitly allowed in my firewall settings.
I should have tested before responding. I completely disabled the firewall and it still doesn’t work. I think I have it narrowed down to weirdness in Windows’ Hyper-V. Multipass even has a section of their docs dedicated to problems like these.
So I discovered and solved some DNS issues related to Multipass and Hyper-V, but this did not solve the dashboard loading failure. I can directly hit the Grafana API from within the VM and immediately receive the payload. But of interest, after following Multipass’ instructions for putting the Hyper-V into a good state, I was able to reproduce the dashboard crash failures. I tried to load up 15487 and, after a wait, the dashboard hard crashes with this error:
Error: Minified React error #31; visit https://reactjs.org/docs/error-decoder.html?invariant=31&args[]=object%20with%20keys%20%7Bstatus%2C%20statusText%2C%20data%2C%20config%2C%20traceId%7D for the full message or use the non-minified dev environment for full errors and additional helpful warnings.
span
94753/l<@https://linkerd-dashboard.local/grafana/public/build/4570.87a4acc6d9144f9ee50a.js:286:66336
div
90613/o<@https://linkerd-dashboard.local/grafana/public/build/4570.87a4acc6d9144f9ee50a.js:213:45471
div
90613/o<@https://linkerd-dashboard.local/grafana/public/build/4570.87a4acc6d9144f9ee50a.js:213:45471
div...
So I started up Ubuntu in my cluster and I have figured out that the problem is likely with CoreDNS. CoreDNS no longer supports DNS compression, so if a response comes back with headers greater than 512 bytes, it barfs. Sigh. At least I’m on the trail.
So I have “solved” my problem. I have downgraded by version of Microk8s to one that uses CoreDNS 1.10.0. I am now able to pull the dashboards. I do not think that my version of Microk8s has changed, so I assume that whatever DNS responses I am getting have changed. This could be coming from my FiOS router. Regardless, CoreDNS should not be this fragile and they are working on a fix as found here: Pod not able to connect to url login.microsoftonline.com:443 · Issue #3604 · kubernetes-sigs/kind · GitHub
So I think I can close this thread. Nothing else to investigate here.
Yeah. I like Microk8s well enough, but I have found myself having to manually install some of the addons to fix incompatibilities or other problems. Basically this exact problem. Hyper-V is the best solution for Windows work, but I’m really thinking about moving to Linux for my dev work. I also use graphics apps that do not have Linux versions, which is why I went all-in with Windows after Apple pissed me off too much. But I think I’ll just accept dual-boot going forward.
As for the downloading the JSON, I want to keep my entire architecture in config yamls and my configuration repo is already a mess of files. I’d need to add files for each dashboard and this is undesirable. Also, this was only the most visible external DNS failure. There were others hidden in my cluster’s behavior that I also need to prevent.
Ah ok. Yeah, I looked into that but I kept crashing into problems with accessing the Linux system from Windows. I assume there are solutions but Multipass and Hyper-V made running VM’s very easy.
Just adding a coda to this. I was encountering log spamming of file watchers as was reported in this thread. The solution to this was to upgrade Microk8s… which now causes dashboards to start failing. As such, I am going to have to install my own DNS solution, thus reducing the usefulness of Microk8s yet further. Kubernetes is so much fun. I’m so glad I chose this as a career.
SUCCESS! Sort of. The invalid UDB headers were coming from the forward to mshome.net (Microsoft’s connection sharing system that is part of WSL and Hyper-V), which was the nameserver for external requests. CoreDNS is still absolutely broken, but the problem also appears to be that the connection sharing system in Windows 10/11 will return oversized UDP headers sometimes. Other times, it works fine! So to solve this, in my CoreDNS configmap, I added the IPs for Google’s public DNS (8.8.8.8 and 8.8.4.4) to the request forwarding. Now it will look to mshome.netand a known DNS. With this, my requests succeed!
forward . /etc/resolv.conf 8.8.8.8 8.8.4.4
Ideally, the CoreDNS problem is solved in the future, but this is fine.