I have been having a random error which so far I have been unable to track down the cause of. Initially I thought it may of been our InfluxDB datasource which was unable to handle the amount of connections. However I have been having this error on different datasources.
I checked out the thread:
With the same issue however the solution within there is not applicable to myself as the MySQL wait_timeout is set to default (28800) as well the Grafana is also set to default (14400). The logs are not showing much other than the change state for the alerting rule.
The setup is two webservers running Grafana and 1 MySQL DB server. Server is healthy with nothing out of the ordinary to report.
I believe this turned out to be spikes in MySQL connections to the database causing the max_connections to be breached suddenly but immediately dropping back below the maximum which was causing behavior of only a couple of checks generating alerts. This spike was undetected by previous monitoring that was setup only when increasing the amount of connections allowed, flushing the stats and then observing where we able to see the quick spike pushing the connections over the previously set limit.
We are experiencing the exact same thing. I have been doing a lot of research and so far have come up with two options to try:
Increase the connection limit on the MySQL side (max_connections parameter) by either changing the value directly, or by switching to a larger instance (we are using an AWS RDS instance) that has more memory which will inherently increase the value of max_connections which, by default, is derived from: DBInstanceClassMemory/12582880.
Honestly, I don’t fully understand the second option because it seems counterintuitive (ie. I’m already hitting a “Too many connections” limit, so why would I change from ‘unlimited’ to some smaller number?). So I need to test it. But the idea is that maybe the 0/unlimited value is causing it to open too many connections to the MySQL server instead of pooling them.
Please let us know if you’ve tried anything, and if it worked for you. I will do the same. Thanks!
Also, if anyone can comment on the max_open_conn setting in Grafana with details on how/if that affects connection pooling, or an explanation regarding when you might want to change the value and why, please do. Thanks.
In this instance we increase the max_connections to largely above what would be required but still safe within the resources we have already. We have not encountered the issue again and do not expect to as our connections are now way below the new limit.
In regards to the ‘max_open_conn’ I would prefer managing the maximum amount of connections in the database rather than the application but I’m sure this option has some cases where it would be useful.
I would like to think that the amount of connections spiking is in relation to the amount of alerts that we have setup but I have no evidence of this. Generally over time we see that as we gain more alerts the maximum connections breached in the DB is pushed higher and higher.