[database] locking_attempt_timeout_sec purpose

Hi Grafana Community,

I’m seeking clarification on the purpose and expected behavior of locking_attempt_timeout_sec as described in the official documentation:

What I understand and expected:

I’m configuring a Grafana High Availability (HA) setup with two Grafana instances and a shared Postgres database using Ansible for automation.

  • During simultaneous startup, I expect that one Grafana instance acquires the lock on the database for schema migrations.
  • The second Grafana instance, if unable to acquire the lock immediately, should wait for the duration specified by locking_attempt_timeout_sec before retrying or failing.

The relevant part of my grafana.ini configuration is:

[database]  
type = postgres  
host = {{ grafana_db_host }}:5432  
name = {{ grafana_db_name }}  
user = {{ grafana_username }}  
password = {{ grafana_database_password }}  
ssl_mode = disable  
locking_attempt_timeout_sec = 30  

What I’m observing(without locking_attempt_timeout_sec):

I’m using Ansible to configure and restart both instances simultaneously.

  • The first instance acquires the database lock and starts correctly.
  • The second instance fails to acquire the lock and logs errors like:
# First attempt
logger=migrator t=2025-01-30T17:19:38.428802237Z level=error msg="Failed to lock database" error="failed to obtain lock"

#... similar attempts 

# Last attempt
logger=migrator t=2025-01-30T17:19:39.48318924Z level=error msg="Failed to lock database" error="failed to obtain lock"  

  • The second instance tries 5 times within 1-2 seconds and then fails the systemd service startup.

Even with locking_attempt_timeout_sec set to 30 or even 60 seconds, I observed the same behavior. The retries happen very quickly and do not respect the specified timeout.

After the failure, I can manually restart the second Grafana instance successfully once the lock is released, but this requires an additional manual or automated restart, which I was hoping to avoid with the proper use of locking_attempt_timeout_sec.


Expected Behavior:

I expected that the second Grafana instance would wait for the specified timeout (30 or 60 seconds) before failing while starting systemd service, allowing the first instance to finish migrations and release the lock.


System Details:

  • Grafana version: 11.4.0
  • OS: Debian 12