Server unresponsive after upgrade to Grafana OSS 8.5.2 from 8.2.0

After upgrading our self hosted (AWS ECS) Grafana from 8.2.0 to 8.5.2 I started noticing that Grafana was very slow, even loading homepage (’/’) would take more than 1 minute, eventually the server would not send a response at all, and would remain in that state. I have rolled back to 8.2.0 and it seems to work better but not 100% back to normal as before. After rolling back I eventually get one instance of a very slow response and then everything is back to normal.

I checked the ECS task metrics and CPU is mostly idle, always below 3%, memory usage is also fine, no network issues, RDS (Postgres) is also mostly idle and no issues there. I also set the log level to debug and re-deployed 8.5.2 but I don’t find any relevant logs that could explain the issue.

I would need help debugging this, and I would also like to know if there is anyone else experiencing this issues.

Have you tried the good old reboot server and restart service?

Yes, several times. Rather than rebooting what I did was stopping the ECS task and starting a new one with the same config. Also redeployed several times which has the same effect. Always the same issue with 8.5.2. Is there a way to have more details logs on what grafana is doing in the background? Is it possible to have Grafana log all the arriving HTTP requests?

Check the grafana logs would be one way

I had the same issue, rebooted and it was all good except I lost all of my https configuration so check to see if you lost config settigs which one is supposed to save before upgrade

No luck for me, I can’t reboot the server as I’m running Grafana in ECS, what I’m doing is stopping the task and starting a new one.

I managed to get more details on the issue, it seems that the delay is on the initial connection, this is the time taken to perform the initial TCP handshake and negotiate SSL. Usually slowness here is due to congestion, the server hit a limit and can’t respond to new connections. I was wondering if it would be possible to see this in the logs.

1 Like

It works fine on my ECS. It doesn’t look like Grafana issue, but your infra issue. Invetigate it on your browser (network console - which request are slow, which times are slow,…), maybe your proxy/vpn, check your ALB Cloudwatch metrics,… - there is many moving parts until requests will reach your ecs task where can be a problem.

1 Like

I found the issue, it was indeed a problem with my infrastructure not related with Grafana. Thanks for your support