Since the 1.3.0
release, we’ve been unable to stand up Grafana Alloy agents in clustering mode with service discovery. Here is what we are doing:
- We are running Grafana Alloy in AWS ECS cluster.
- ECS service is configured with AWS ECS service discovery, this generates SVR records for each agent container.
- Grafana Alloy starting command looks like this:
[
"run",
"--disable-reporting=true",
"--cluster.enabled=true",
"--cluster.join-addresses=<SERVICE_DISCOVERY_RECORD>",
"--cluster.max-join-peers=0",
"--cluster.name=<CLUSTER_ID>",
"--server.http.listen-addr=0.0.0.0:<PORT>",
"--storage.path=/data/alloy",
"/etc/config.alloy"
]
This used to work fine in v1.2.8, but with v1.3.0 it’s now failing with the following errors:
ts=2024-08-07T20:02:27.745041327Z level=error msg=“fatal error: failed to get peers to join at startup - this is likely a configuration error” service=cluster err=“static peer discovery: failed to find any valid join addresses: failed to extract host and port: address alloy-cluster.services.internal: missing port in address\nfailed to resolve SRV records: lookup alloy-cluster.services.internal on 10.104.96.2:53: no such host”
Of course the service discovery SRV record of course doesn’t exist until the container is up and running, but that’s expected. And without a SRV record container won’t start, thereby creating a chicken-and-egg problem. This used to work in v1.2, but not with v1.3.0. Does anyone know if there is any sort of cluster join delay configuration?