Intermittent alert image rendering timeout (408 / rendering.serverTimeout) with remote grafana-image-renderer (Docker, 12GB RAM allocated)

Hello,

I am experiencing intermittent image rendering failures for alert notifications when using Grafana with a remote grafana-image-renderer service in Docker.

Manual rendering works, but alert-triggered screenshots frequently fail with timeout errors.


Environment

  • Grafana OSS (Docker)

  • grafana-image-renderer (Docker, remote service)

  • Prometheus datasource

  • Unified Alerting enabled

  • SMTP configured and working

  • Host resources:

    • 15 GB RAM total

    • ~14 GB available

    • No system memory pressure


Current Configuration

Grafana container (Docker)

Environment variables:

GF_RENDERING_SERVER_URL=http://grafana-renderer:8081/render
GF_RENDERING_CALLBACK_URL=http://grafana:3000/
GF_RENDERING_RENDERER_TOKEN=S3cureRend3rT0ken_ChangeMe_1234567890

GF_UNIFIED_ALERTING_ENABLED=true
GF_UNIFIED_ALERTING_SCREENSHOTS_CAPTURE=true
GF_UNIFIED_ALERTING_SCREENSHOTS_CAPTURE_TIMEOUT=30s
GF_UNIFIED_ALERTING_EVALUATION_TIMEOUT=90s

GF_LOG_FILTERS=rendering:debug,ngalert.image:debug


Renderer container (Docker)

Renderer is running with increased resources:

--cpus="4.0"
--memory="12g"
--shm-size="4g"

AUTH_TOKEN=S3cureRend3rT0ken_ChangeMe_1234567890
RENDERING_MODE=clustered
RENDERING_CLUSTERING_MODE=context
RENDERING_CLUSTERING_MAX_CONCURRENCY=1
RENDERING_RENDER_TIMEOUT=60
RENDERING_TIMEOUT=60


What works

  • Share → Direct link rendered image works correctly.

  • /render/version endpoint works.

  • SMTP works and emails are sent.

  • Token authentication between Grafana and renderer is correct (no 401 errors anymore).

  • Renderer container has sufficient RAM (12 GB allocated).


What fails

When the alert fires, Grafana calls the renderer with:

timeout=30

After 30 seconds, it fails.


Grafana Logs

calling remote rendering service
url="http://grafana-renderer:8081/render?...timeout=30..."

Failed to render image
error="[rendering.serverTimeout]"

Failed to take an image
reason="transition to alerting"


Renderer Logs

uri="/render?...timeout=30..."
status=408
status_text="Request Timeout"
duration=29.979s


Important Observation

  • Manual rendering works.

  • Alert rendering fails.

  • The renderer responds exactly at 30 seconds with status=408.

  • The timeout parameter passed from Grafana is still timeout=30 even though renderer itself is configured for 60 seconds.

This indicates that the limiting factor is Grafana’s alert screenshot timeout (30s), not renderer capacity.


Alert Details

The alert is for a single panel:

  • One query

  • One panel (d-solo view)

  • Mikrotik total connection count metric

However, the panel belongs to a dashboard that contains template variables (including $__all and query-based variables).

It seems that during alert rendering, the full dashboard context (variables / scenes renderer) affects page readiness and sometimes exceeds the 30-second screenshot timeout.


Question

Is this expected behavior in Grafana 11/12 with:

  • Unified Alerting

  • Remote renderer

  • Scene-based dashboards

  • Query-type template variables

Is the only reliable solution to:

  • Create a separate minimal dashboard without variables for alert screenshots?

Or is there a recommended configuration change to prevent rendering.serverTimeout for alert screenshots?

Concurrency can be a problem. Play with concurrent_render_request_limit/max_concurrent_screenshots.