Hello guys.
I know most of you are busy with your day, so I’ll try to put as much details as possible to explain the problem I have.
Grafana version v9.4.3
Plugin: grafana-image-renderer 3.6.4
Used OS: Ubuntu server 22.04 hosted in AWS ec2 instance.
I have some experience with Grafana before, but not with the latest versions. I already have 3 different setups with Grafana and grafana-image-renderer but they use the old alerting management not the current one so called Unified Alerting.
In the current setup I’m trying to achieve the same results like before.
Grafana + alerts + image(screenshots) in the alerts notifications.
I’ve created the following setup:
Independent Server for Grafana service and plugins and the memcached service (for better optimization and catching some data sources configurations)
RDS instance with PostgreSQL for storing configurations/dashboards/users/alerts/etc…
Independent server with running image rendering service.
How the services are configured:
Grafana works on port 3000 and it is behind Application LB that serve the web UI at port 443 (https)
Grafana deployment is pretty much standard. The single additional plugin I added is grafana-image-renderer. The plugin is installed via grafana command line interface.
The rendering server is run as standalone Node.js application. I don’t use docker on any of the servers at the moment. It is installed as per the readme in GitHub - grafana/grafana-image-renderer: A Grafana backend plugin that handles rendering of panels & dashboards to PNGs using headless browser (Chromium/Chrome)
The service runs on 0.0.0.0 and use its default port 8081 as systemd service.
The IP address of the server is defined as A record to a specific hostname that is used on other configurations in Grafana.
The installed Chrome version:
/opt/google/chrome/chrome --version
Google Chrome 111.0.5563.64
I don’t think the problem is the renderer service because the log file doesn’t show me errors, but I can clearly see that Grafana is checking the version.
Example
{"level":"debug","message":"172.17.40.158 - - [10/Mar/2023:13:07:56 +0000] \"GET /render/version HTTP/1.1\" 200 19 \"-\" \"Grafana/9.4.3\"\n"}
I don’t see any other output in the renderer log that means Grafana don’t even try to use it for some reason.
Grafana has the following settings in regards of the images/screenshots.
I’ve disabled the legacy alerts and enabled the Unified Alerting
[alerting]
enabled = false
[unified_alerting]
enabled = true
Screenshots.
[unified_alerting.screenshots]
capture = true
capture_timeout = 10s
max_concurrent_screenshots = 5
upload_external_image_storage = true
Images store is configured to use Amazon S3
#################################### External image storage ##########################
[external_image_storage]
# Used for uploading images to public servers so they can be included in slack/email messages.
# you can choose between (s3, webdav, gcs, azure_blob, local)
provider = s3
[external_image_storage.s3]
endpoint = https://s3.us-east-1.amazonaws.com/
bucket = mybucket-images
region = us-east-1
Rendering settings (the token is the same on renderer service and grafana)
[rendering]
server_url = http://renderer-dev.mydomain.net:8081/render
callback_url = https://grafana-dev.mydomain.net/
renderer_token MyToken
[plugin.grafana-image-renderer]
rendering_timezone = UTC
rendering_ignore_https_errors = true
rendering_mode = clustered
rendering_clustering_mode = context
Grafana root_url is pointed to the application LB and it is accessible from the rendered server. (required for callback)
Everything looks normal. Grafana is accessible. I created several dashboards with source Croudwatch.
I created a single alert to start testing the new deployment. Created a simple e-mail and Pagerduty notifications or so called Contact points and used the default notification policy.
The alerts work, but no screenshot attached.
The only error I found in the logs is this:
logger=ngalert.state.manager rule_uid=JhUNV4-4k org_id=1 instance= t=2023-03-10T14:12:00.156815262Z level=debug msg="Keeping state" state=Alerting
logger=ngalert.image rule_uid=JhUNV4-4k org_id=1 dashboard=iudpAn-4z panel=2 t=2023-03-10T14:12:00.156853933Z level=debug msg="Requesting screenshot"
logger=rendering renderer=http t=2023-03-10T14:12:00.168735585Z level=info msg=Rendering path="d-solo/iudpAn-4z/peycho-test-dash?from=now-1h&orgId=1&panelId=2&to=now"
logger=ngalert.state.manager rule_uid=JhUNV4-4k org_id=1 instance= t=2023-03-10T14:12:00.169257399Z level=warn msg="Failed to take an image" dashboard=iudpAn-4z panel=2 error="failed to take screenshot: dial tcp :0: connect: connection refused"
On the renderer service side I see nothing but the last version check made by Grafana itself.
{"level":"debug","message":"172.17.40.158 - - [10/Mar/2023:14:11:19 +0000] \"GET /render/version HTTP/1.1\" 200 19 \"-\" \"Grafana/9.4.3\"\n"}
In general, I’m kinda stuck with this error that shows me nothing that would help me.
It even don’t show the ip/port that it’s attempting to connect to.
Note that Grafana and renderer service are both configured with debug option for all logs. console and file.
I can provide any additional information if necessary.
Thanks in advance.