Grafana Image Renderer Performance Tips

Hello,

Does anyone have tips for better performance for Grafana Image Renderer?

I’d like find a way to be able to send at least 100-150 alerts with image “at the same time” (not necessarily simulatenously, but at the same moment).

Now I have a grafana running in a 8-core 16gb ram instance, but when I simulate a general downtime for 100 servers, grafana can only send 8-10 emails with the screenshot, and for the rest we receive without screenshot, and on grafana.log we have a lot of “Failed to delete render key” and “context deadline exceeded” errors.

For this situation (100-150 alerts at the same instant) with each of them running a query for inflxdb, is there some server setup recommendation or grafana settings to try?

Grafana version: 10
Server setup 8core 16gb ram Centos7

Hi! :wave:

I see you’re talking about taking images for 100-150 alerts firing at the same time. How many panels do you have? The reason I ask is that Grafana caches images for 1 minute for the same panel to reduce the amount of work that needs to be done by the image renderer. That means if you have 150 alerts, but they are associated with just 10 panels, then Grafana will only take 10 screenshots, not 150.

Hello! I’m considering 150 different panels, 1 alert for each one

OK! In that I suspect case you’ll want to run Grafana image renderer in cluster mode. You can either run the docker image (Docker) or build and run it from source (GitHub - grafana/grafana-image-renderer: A Grafana backend plugin that handles rendering of panels & dashboards to PNGs using headless browser (Chromium/Chrome)).

The maximum timeout you can configure for screenshots at present is 30 seconds. 150 alerts / 30 seconds is 5 images per second. You’ll probably want to change your screenshot configuration in the Grafana ini file to something like the following:

[unified_alerting.screenshots]
capture_timeout = 30s
max_concurrent_screenshots = 6

Then do the same for the Grafana image renderer and run at a minimum concurrency of 6. You can find more information about this here.

Do you recommend
RENDERING_CLUSTERING_MODE=browser
or
RENDERING_CLUSTERING_MODE=context ?

There are pros and cons to each of them:

Using a cluster of incognito pages is more performant and consumes less CPU and memory than a cluster of browsers. However, if one page crashes it can bring down the entire browser with it (making all the rendering requests happening at the same time fail). Also, each page isn’t guaranteed to be totally clean (cookies and storage might bleed-through as seen here)

If the disadvantages of context are tolerable then context should be faster than browser.

1 Like

So, I tried clustering mode, and even with 20 alerts, all CPU cores go to 100%, I tried 16 core last time, and I get this error on logs:

logger=rendering t=2023-09-01T09:10:39.516601815-03:00 level=error msg=“Failed to delete render key” error=“context deadline exceeded”
logger=ngalert.state.manager rule_uid=e4575af2-27d9-4b79-9d19-5142de787596 org_id=1 t=2023-09-01T09:10:39.516701518-03:00 level=warn msg=“Failed to take an image” dashboard=cd3f4355-c257-4a5a-b250-e2327ad3b59b panel=20 error="failed to take screenshot: [rendering.serverTimeout] "

Settings:
concurrent_render_limit = 10
[unified_alerting.screenshots]
capture_timeout = 30s
max_concurrent_screenshots = 10

server_url = http://localhost:8081/render
callback_url = http://localhost:3000/

RENDERING_MODE=clustered
RENDERING_CLUSTERING_MODE=browser
RENDERING_CLUSTERING_MAX_CONCURRENCY=10
RENDERING_CLUSTERING_TIMEOUT=30

Same happens with RENDERING_CLUSTERING_MODE=context

logger=rendering t=2023-09-01T09:21:38.365757021-03:00 level=error msg=“Failed to delete render key” error=“context deadline exceeded”
logger=ngalert.state.manager rule_uid=d7b51a7a-8f6a-44c7-bdee-e664b28f0166 org_id=1 instance=“datasource_uid=000000009, ref_id=A” t=2023-09-01T09:21:38.365867324-03:00 level=warn msg=“Failed to take an image” dashboard=cd3f4355-c257-4a5a-b250-e2327ad3b59b panel=49 error=“failed to take screenshot: [rendering.serverTimeout] "
logger=ngalert.sender.router rule_uid=d7b51a7a-8f6a-44c7-bdee-e664b28f0166 org_id=1 t=2023-09-01T09:21:38.375523935-03:00 level=info msg=“Sending alerts to local notifier” count=1
logger=rendering renderer=http t=2023-09-01T09:21:38.482955788-03:00 level=error msg=“Failed to send request to remote rendering service” error=“Get "http://localhost:8081/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=500&renderKey=WAtvYp861gS6O0xzliOno23E15xDcYu1&timeout=30&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2Ff564eb2d-88b8-4cc4-9507-c8b55303f485%2Ftrans2001sp-alertas-novo%3Ffrom%3Dnow-1h%26orgId%3D1%26panelId%3D2%26to%3Dnow%26render%3D1&width=1000\”: context deadline exceeded”

Logs on grafana image renderer (running as standalone service)

Sep 01 09:15:45 monitor.galafassi.com.br node[15737]: {“level”:“info”,“message”:“HTTP Server started, listening at http://localhost:8081”}
Sep 01 09:21:35 monitor.galafassi.com.br node[15737]: {“level”:“error”,“message”:“Request failed”,“stack”:“Error: Request aborted\n at onaborted (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1052:15)\n at Immediate._onImmediate (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1094:9)\n at processImmediate (node:internal/timers:466:21)”,“url”:“/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=500&renderKey=f9KIAp38j2dxodwcXCdpnrpEwjEa5CJB&timeout=30&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2Fcd3f4355-c257-4a5a-b250-e2327ad3b59b%2Ftrans2001-alertas-novo%3Ffrom%3Dnow-1h%26orgId%3D1%26panelId%3D82%26to%3Dnow%26render%3D1&width=1000”}
Sep 01 09:21:37 monitor.galafassi.com.br node[15737]: {“level”:“error”,“message”:“Request failed”,“stack”:“Error: Request aborted\n at onaborted (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1052:15)\n at Immediate._onImmediate (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1094:9)\n at processImmediate (node:internal/timers:466:21)”,“url”:“/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=500&renderKey=n2tkARJ3IjEQgYOPmqzbgsvm6XaTUnNR&timeout=30&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2Fcd3f4355-c257-4a5a-b250-e2327ad3b59b%2Ftrans2001-alertas-novo%3Ffrom%3Dnow-1h%26orgId%3D1%26panelId%3D35%26to%3Dnow%26render%3D1&width=1000”}
Sep 01 09:21:37 monitor.galafassi.com.br node[15737]: {“level”:“error”,“message”:“Request failed”,“stack”:“Error: Request aborted\n at onaborted (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1052:15)\n at Immediate._onImmediate (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1094:9)\n at processImmediate (node:internal/timers:466:21)”,“url”:“/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=500&renderKey=dO3J8FO5Rusjv6ho7Dme7R8zP7RjWnNZ&timeout=30&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2Fcd3f4355-c257-4a5a-b250-e2327ad3b59b%2Ftrans2001-alertas-novo%3Ffrom%3Dnow-1h%26orgId%3D1%26panelId%3D31%26to%3Dnow%26render%3D1&width=1000”}
Sep 01 09:21:43 monitor.galafassi.com.br node[15737]: {“level”:“error”,“message”:“Request failed”,“stack”:“Error: Request aborted\n at onaborted (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1052:15)\n at Immediate._onImmediate (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1094:9)\n at processImmediate (node:internal/timers:466:21)”,“url”:“/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=500&renderKey=WAtvYp861gS6O0xzliOno23E15xDcYu1&timeout=30&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2Ff564eb2d-88b8-4cc4-9507-c8b55303f485%2Ftrans2001sp-alertas-novo%3Ffrom%3Dnow-1h%26orgId%3D1%26panelId%3D2%26to%3Dnow%26render%3D1&width=1000”}
Sep 01 09:21:43 monitor.galafassi.com.br node[15737]: {“level”:“error”,“message”:“Request failed”,“stack”:“Error: Request aborted\n at onaborted (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1052:15)\n at Immediate._onImmediate (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1094:9)\n at processImmediate (node:internal/timers:466:21)”,“url”:“/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=500&renderKey=I296mreCBehwmhfKwOkMMfYYWpTagLDx&timeout=30&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2Fcd3f4355-c257-4a5a-b250-e2327ad3b59b%2Ftrans2001-alertas-novo%3Ffrom%3Dnow-1h%26orgId%3D1%26panelId%3D49%26to%3Dnow%26render%3D1&width=1000”}
Sep 01 09:21:44 monitor.galafassi.com.br node[15737]: {“level”:“error”,“message”:“Request failed”,“stack”:“Error: Request aborted\n at onaborted (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1052:15)\n at Immediate._onImmediate (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1094:9)\n at processImmediate (node:internal/timers:466:21)”,“url”:“/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=500&renderKey=5rUQYsHm7PL9bkX02zTn2v9HVhXcVRKu&timeout=30&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2Fcd3f4355-c257-4a5a-b250-e2327ad3b59b%2Ftrans2001-alertas-novo%3Ffrom%3Dnow-1h%26orgId%3D1%26panelId%3D53%26to%3Dnow%26render%3D1&width=1000”}
Sep 01 09:21:45 monitor.galafassi.com.br node[15737]: {“level”:“error”,“message”:“Request failed”,“stack”:“Error: Request aborted\n at onaborted (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1052:15)\n at Immediate._onImmediate (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1094:9)\n at processImmediate (node:internal/timers:466:21)”,“url”:“/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=500&renderKey=bh2RY9nDRuDDlwBKhkkoT3lLmvnXPYn9&timeout=30&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2Fcd3f4355-c257-4a5a-b250-e2327ad3b59b%2Ftrans2001-alertas-novo%3Ffrom%3Dnow-1h%26orgId%3D1%26panelId%3D2%26to%3Dnow%26render%3D1&width=1000”}
Sep 01 09:21:46 monitor.galafassi.com.br node[15737]: {“level”:“error”,“message”:“Request failed”,“stack”:“Error: Request aborted\n at onaborted (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1052:15)\n at Immediate._onImmediate (/etc/grafana/grafana-image-renderer/node_modules/express/lib/response.js:1094:9)\n at processImmediate (node:internal/timers:466:21)”,“url”:“/render?deviceScaleFactor=1.000000&domain=localhost&encoding=&height=500&renderKey=B7NKftxWZGTxyhaprGQqQ07gtCD3Ijeo&timeout=30&timezone=&url=http%3A%2F%2Flocalhost%3A3000%2Fd-solo%2Fcd3f4355-c257-4a5a-b250-e2327ad3b59b%2Ftrans2001-alertas-novo%3Ffrom%3Dnow-1h%26orgId%3D1%26panelId%3D20%26to%3Dnow%26render%3D1&width=1000”}

So, I tried clustering mode, and even with 20 alerts, all CPU cores go to 100%.

I’m not that surprised to be honest with you, image rendering is very resource intensive. You can try to reduce concurrent_render_limit and max_concurrent_screenshots to reduce the peak load on your server as this may give the image renderer more CPU time before the timeout. If you have panels that query a lot of data and are slow to load this too will contribute to the timeout.

However, the fact that your CPU has all 16 cores at 100% CPU does suggest that you’re going to need more compute. You might be better off consolidating your alerts to fewer panels because then the screenshots can be cached by Grafana and shared between alerts.

Hello, can you explain how you were able to simulate concurrent requests? TIA

Hello, in my case I have more than 100 servers monitored by grafana, with NoData alert enabled, so I can just block the connection to all these servers by a firewall rule and then grafana alerts all of them at once.

But in this case we ended up disabling image rendering due to high resource consumption that in the end of the day is not so critical to us.