Intermittent Bad Gateway error since upgrading Grafana and InfluxDB

I’m struggling a little with this error, because it seems to be happening on and off rather than consistently - even in different panels of the same dashboard. This evening I made the rather foolish mistake of upgrading a few things at once (all of this was triggered by an attempt to fix American date formats in the graphs… ironically, that hasn’t been fixed at all despite the upgrade and editing the date formats in the config file). Anyway, the upshot was I upgraded grafana to v. 8.3.4 and influxdb to 1.9.6 (at first I went to influx v2, but that was Very Bad - absolutely everything broke - so I went back to the latest version of influxdb v1).

Everything else running with influxdb seems happy (although that’s really just my openHAB installation… but it seems to be working with no errors at all). But now, Grafana is intermittently flagging bad gateway errors - sometimes they appear on different panels of the dashboard, sometimes when an error appear the data is missing, but sometimes the data is on the graph with the error flag still displayed, sometimes when I “save & test” the data source I get the error, but sometimes I don’t. Honestly, it’s just flaky as hell, and I’m not changing anything except hitting the refresh button.

Does anyone have suggestions as to how to trace the problem further? Installation of both influxdb and grafana are using Homebrew on Mac OS X. Log file locations, etc are the default.

Here’s a few lines from the log:

logger=context userId=1 orgId=1 uname=admin t=2022-02-07T01:06:57.24+1100 lvl=info msg="Request Completed" method=GET path=/api/live/ws status=0 remote_addr=192.168.11.26 time_ms=1 size=0 referer=
logger=context userId=1 orgId=1 uname=admin t=2022-02-07T01:06:57.39+1100 lvl=eror msg="Request Completed" method=POST path=/api/datasources/proxy/2/query status=502 remote_addr=192.168.11.26 time_ms=1 size=0 referer="http://192.168.11.5:3000/d/x5V4JhViz/weather-and-presence?orgId=1&refresh=5m&from=now-3d&to=now"
logger=context userId=1 orgId=1 uname=admin t=2022-02-07T01:06:57.53+1100 lvl=eror msg="Request Completed" method=GET path=/api/datasources/proxy/1/query status=502 remote_addr=192.168.11.26 time_ms=1 size=0 referer="http://192.168.11.5:3000/d/x5V4JhViz/weather-and-presence?orgId=1&refresh=5m&from=now-3d&to=now"

And here’s a screenshot of the corresponding output (a good mix of errors/missing data/working data) - all data is drawn from just two databases within influx, and they seem equally prone to this problem.

OK - I think I might have found the source of the problem, and will post the (possible) solution here in case it hits anyone else:

Tailing the influx log file for a while, I noticed some error lines appearing complaining about “too many open files”. This led me down a rabbit hole of people talking about how to increase the available open files on a Mac system, but eventually I seem to have fixed the problem by doing the following:

sudo launchctl limit maxfiles 10000 20000
brew services restart influxdb@1

I’ve also created a plist to persist these settings across reboots (using the instructions here), so hopefully that’s the end of this particular heisenbug!