[database] locking_attempt_timeout_sec purpose

oleksiiyuvchenko · January 31, 2025, 8:14am

Hi Grafana Community,

I’m seeking clarification on the purpose and expected behavior of locking_attempt_timeout_sec as described in the official documentation:

What I understand and expected:

I’m configuring a Grafana High Availability (HA) setup with two Grafana instances and a shared Postgres database using Ansible for automation.

During simultaneous startup, I expect that one Grafana instance acquires the lock on the database for schema migrations.
The second Grafana instance, if unable to acquire the lock immediately, should wait for the duration specified by locking_attempt_timeout_sec before retrying or failing.

The relevant part of my grafana.ini configuration is:

[database]  
type = postgres  
host = {{ grafana_db_host }}:5432  
name = {{ grafana_db_name }}  
user = {{ grafana_username }}  
password = {{ grafana_database_password }}  
ssl_mode = disable  
locking_attempt_timeout_sec = 30

What I’m observing(without locking_attempt_timeout_sec):

I’m using Ansible to configure and restart both instances simultaneously.

The first instance acquires the database lock and starts correctly.
The second instance fails to acquire the lock and logs errors like:

# First attempt
logger=migrator t=2025-01-30T17:19:38.428802237Z level=error msg="Failed to lock database" error="failed to obtain lock"

#... similar attempts 

# Last attempt
logger=migrator t=2025-01-30T17:19:39.48318924Z level=error msg="Failed to lock database" error="failed to obtain lock"

The second instance tries 5 times within 1-2 seconds and then fails the systemd service startup.

Even with locking_attempt_timeout_sec set to 30 or even 60 seconds, I observed the same behavior. The retries happen very quickly and do not respect the specified timeout.

After the failure, I can manually restart the second Grafana instance successfully once the lock is released, but this requires an additional manual or automated restart, which I was hoping to avoid with the proper use of locking_attempt_timeout_sec.

Expected Behavior:

I expected that the second Grafana instance would wait for the specified timeout (30 or 60 seconds) before failing while starting systemd service, allowing the first instance to finish migrations and release the lock.

System Details:

Grafana version: 11.4.0
OS: Debian 12

Topic		Replies	Views
Want Zero downtime on grafana Configuration templating , backend-db , dashboard	2	409	May 4, 2023
How to have Grafana connect to config database running on 2-node PostgreSQL cluster? Configuration postgres	3	1230	July 14, 2021
Aws: connection to postgres keep timing out PostgreSQL postgres	1	647	November 27, 2018
Intermittent query timeout & incorrect results Grafana Cloud postgres	1	731	November 2, 2021
Backup Grafana in Kubernetes system - pod KO after restart Configuration	4	65	December 20, 2024

[database] locking_attempt_timeout_sec purpose

What I understand and expected:

What I’m observing(without locking_attempt_timeout_sec):

Expected Behavior:

System Details:

Related topics