PDC agent connection timeout and fail randomly 2-3 times per week

I am using PDC to connect my ClickHouse to Grafana Cloud. I followed the documentation and launched three agents for high availability. However, I still experience disconnections about 2–3 times per week. The errors usually occur when I query the data. Here are some sample errors:

  • Error = failed to execute query [A]: error querying the database: write: write tcp 10.18.118.94:52240->10.52.180.51:443: write: connection timed out
failed to execute query [A]: error querying the database: downstream error: socks connect tcp private-datasource-connect.hosted-grafana.svc.cluster.local:443->localhost:9000: EOF
Post "https://query-grafana-app-main.grafana-datasources.svc.cluster.local.:6443/apis/query.grafana.app/v0alpha1/namespaces/stacks-1093114/query": context deadline exceeded

When such errors occur, I try to refresh the dashboard. Some panels manage to update, but most of them continue to fail. The successful updates seem to happen randomly for each panel.

I check the log of pdc on my machine. Nothing abnormal.