Grafana Alloy fail to join cluster with service discovery

Since the 1.3.0 release, we’ve been unable to stand up Grafana Alloy agents in clustering mode with service discovery. Here is what we are doing:

  1. We are running Grafana Alloy in AWS ECS cluster.
  2. ECS service is configured with AWS ECS service discovery, this generates SVR records for each agent container.
  3. Grafana Alloy starting command looks like this:
[
        "run",
        "--disable-reporting=true",
        "--cluster.enabled=true",
        "--cluster.join-addresses=<SERVICE_DISCOVERY_RECORD>",
        "--cluster.max-join-peers=0",
        "--cluster.name=<CLUSTER_ID>",
        "--server.http.listen-addr=0.0.0.0:<PORT>",
        "--storage.path=/data/alloy",
        "/etc/config.alloy"
]

This used to work fine in v1.2.8, but with v1.3.0 it’s now failing with the following errors:

ts=2024-08-07T20:02:27.745041327Z level=error msg=“fatal error: failed to get peers to join at startup - this is likely a configuration error” service=cluster err=“static peer discovery: failed to find any valid join addresses: failed to extract host and port: address alloy-cluster.services.internal: missing port in address\nfailed to resolve SRV records: lookup alloy-cluster.services.internal on 10.104.96.2:53: no such host”

Of course the service discovery SRV record of course doesn’t exist until the container is up and running, but that’s expected. And without a SRV record container won’t start, thereby creating a chicken-and-egg problem. This used to work in v1.2, but not with v1.3.0. Does anyone know if there is any sort of cluster join delay configuration?

Hi, thanks for reporting the problem. You can find more info about it here: alloy metrics pod crashes with alloy v1.2.1 when using prometheus.operator.servicemonitors config · Issue #1349 · grafana/alloy · GitHub.

1 Like

I am going to mark this unsolved again. The GitHub issue was closed, and I think it’s a different issue from what I am observing.

It’s unclear why it works with helm but not when I try to deploy it through other means. I’ll have to dig into it. But issue remains, when I try to deploy a fresh alloy container with cluster mode enabled it will fail if service discovery record does not exist.

Ok, looks like it’s relying on Kubernete’s publishNotReadyAddresses=true feature. In my opinion this renders cluster mode useless for every other container platform, and is rather unacceptable.

Opened an issue here: Alloy 1.3.0 cluster mode fails to start with new cluster on non-Kubernetes platform · Issue #1441 · grafana/alloy · GitHub

Hopefully this can be resolved.