Keep-Alive for for long running panel queries when fronted by a reverse proxy

  • What Grafana version and what operating system are you using?
    Grafana v9.3.2 Alpine image, running on Azure App Service

  • What are you trying to achieve?
    I’ve got Grafana (9.3.2 Alpine container) deployed to an Azure App Service, connected with a Loki datasource. Everything has been working ok, except I’m having issues with long running queries to Loki.

For longer Loki queries, I’m receiving a time out from the Azure App Service side:

1 queries with total query time of 4.00 min
status: 504
statusText: "Gateway Timeout"

This always times out after 4 minutes, and unfortunately I’ve found it’s an Azure limit, that cannot be increased:

Is it possible to handle a query timeout from an external loadbalancer without increasing the timeout on the loadbalancer, but instead by sending a TCP Keep-Alive for the connection?

I was looking into the dataproxy settings and found the keep_alive_seconds setting, which I thought may work for this use case: Configure Grafana | Grafana documentation

Unfortunately experimenting with this value hasn’t helped so was wondering if anyone else has experienced something similar and knows of a way around it before I switch Infrastructure setup.

Here’s my relevant proxy config settings:

[server]
# The full public facing url you use in browser, used for redirects and emails
# If you use reverse proxy and sub path specify full url (with sub path)
root_url = <The fqdn for my loadbalancer>

[dataproxy]
# This enables data proxy logging, default is false
logging = true

# How long the data proxy waits to read the headers of the response before timing out
timeout = 30
# How long the data proxy waits to establish a TCP connection before timing out
dialTimeout = 10
# How many seconds the data proxy waits before sending a keepalive probe request.
keep_alive_seconds = 15

Thanks in advance!

I have not experienced this but why do you have long running queries in the first place? I think fixing that would be my priority. any other configs (timeout increase, keep alive settings etc) will just keep pushing the root issue?