Hello,
I am experiencing intermittent image rendering failures for alert notifications when using Grafana with a remote grafana-image-renderer service in Docker.
Manual rendering works, but alert-triggered screenshots frequently fail with timeout errors.
Environment
-
Grafana OSS (Docker)
-
grafana-image-renderer (Docker, remote service)
-
Prometheus datasource
-
Unified Alerting enabled
-
SMTP configured and working
-
Host resources:
-
15 GB RAM total
-
~14 GB available
-
No system memory pressure
-
Current Configuration
Grafana container (Docker)
Environment variables:
GF_RENDERING_SERVER_URL=http://grafana-renderer:8081/render
GF_RENDERING_CALLBACK_URL=http://grafana:3000/
GF_RENDERING_RENDERER_TOKEN=S3cureRend3rT0ken_ChangeMe_1234567890
GF_UNIFIED_ALERTING_ENABLED=true
GF_UNIFIED_ALERTING_SCREENSHOTS_CAPTURE=true
GF_UNIFIED_ALERTING_SCREENSHOTS_CAPTURE_TIMEOUT=30s
GF_UNIFIED_ALERTING_EVALUATION_TIMEOUT=90s
GF_LOG_FILTERS=rendering:debug,ngalert.image:debug
Renderer container (Docker)
Renderer is running with increased resources:
--cpus="4.0"
--memory="12g"
--shm-size="4g"
AUTH_TOKEN=S3cureRend3rT0ken_ChangeMe_1234567890
RENDERING_MODE=clustered
RENDERING_CLUSTERING_MODE=context
RENDERING_CLUSTERING_MAX_CONCURRENCY=1
RENDERING_RENDER_TIMEOUT=60
RENDERING_TIMEOUT=60
What works
-
Share → Direct link rendered imageworks correctly. -
/render/versionendpoint works. -
SMTP works and emails are sent.
-
Token authentication between Grafana and renderer is correct (no 401 errors anymore).
-
Renderer container has sufficient RAM (12 GB allocated).
What fails
When the alert fires, Grafana calls the renderer with:
timeout=30
After 30 seconds, it fails.
Grafana Logs
calling remote rendering service
url="http://grafana-renderer:8081/render?...timeout=30..."
Failed to render image
error="[rendering.serverTimeout]"
Failed to take an image
reason="transition to alerting"
Renderer Logs
uri="/render?...timeout=30..."
status=408
status_text="Request Timeout"
duration=29.979s
Important Observation
-
Manual rendering works.
-
Alert rendering fails.
-
The renderer responds exactly at 30 seconds with status=408.
-
The timeout parameter passed from Grafana is still
timeout=30even though renderer itself is configured for 60 seconds.
This indicates that the limiting factor is Grafana’s alert screenshot timeout (30s), not renderer capacity.
Alert Details
The alert is for a single panel:
-
One query
-
One panel (d-solo view)
-
Mikrotik total connection count metric
However, the panel belongs to a dashboard that contains template variables (including $__all and query-based variables).
It seems that during alert rendering, the full dashboard context (variables / scenes renderer) affects page readiness and sometimes exceeds the 30-second screenshot timeout.
Question
Is this expected behavior in Grafana 11/12 with:
-
Unified Alerting
-
Remote renderer
-
Scene-based dashboards
-
Query-type template variables
Is the only reliable solution to:
- Create a separate minimal dashboard without variables for alert screenshots?
Or is there a recommended configuration change to prevent rendering.serverTimeout for alert screenshots?