Grafana Alloy fail to join cluster with service discovery

tonyswumac · August 7, 2024, 8:16pm

Since the 1.3.0 release, we’ve been unable to stand up Grafana Alloy agents in clustering mode with service discovery. Here is what we are doing:

We are running Grafana Alloy in AWS ECS cluster.
ECS service is configured with AWS ECS service discovery, this generates SVR records for each agent container.
Grafana Alloy starting command looks like this:

[
        "run",
        "--disable-reporting=true",
        "--cluster.enabled=true",
        "--cluster.join-addresses=<SERVICE_DISCOVERY_RECORD>",
        "--cluster.max-join-peers=0",
        "--cluster.name=<CLUSTER_ID>",
        "--server.http.listen-addr=0.0.0.0:<PORT>",
        "--storage.path=/data/alloy",
        "/etc/config.alloy"
]

This used to work fine in v1.2.8, but with v1.3.0 it’s now failing with the following errors:

ts=2024-08-07T20:02:27.745041327Z level=error msg=“fatal error: failed to get peers to join at startup - this is likely a configuration error” service=cluster err=“static peer discovery: failed to find any valid join addresses: failed to extract host and port: address alloy-cluster.services.internal: missing port in address\nfailed to resolve SRV records: lookup alloy-cluster.services.internal on 10.104.96.2:53: no such host”

Of course the service discovery SRV record of course doesn’t exist until the container is up and running, but that’s expected. And without a SRV record container won’t start, thereby creating a chicken-and-egg problem. This used to work in v1.2, but not with v1.3.0. Does anyone know if there is any sort of cluster join delay configuration?

williamdumont · August 7, 2024, 10:32pm

Hi, thanks for reporting the problem. You can find more info about it here: alloy metrics pod crashes with alloy v1.2.1 when using prometheus.operator.servicemonitors config · Issue #1349 · grafana/alloy · GitHub.

tonyswumac · August 8, 2024, 9:02pm

I am going to mark this unsolved again. The GitHub issue was closed, and I think it’s a different issue from what I am observing.

It’s unclear why it works with helm but not when I try to deploy it through other means. I’ll have to dig into it. But issue remains, when I try to deploy a fresh alloy container with cluster mode enabled it will fail if service discovery record does not exist.

tonyswumac · August 8, 2024, 9:08pm

Ok, looks like it’s relying on Kubernete’s publishNotReadyAddresses=true feature. In my opinion this renders cluster mode useless for every other container platform, and is rather unacceptable.

tonyswumac · August 8, 2024, 9:40pm

Opened an issue here: Alloy 1.3.0 cluster mode fails to start with new cluster on non-Kubernetes platform · Issue #1441 · grafana/alloy · GitHub

Hopefully this can be resolved.

Topic		Replies	Views
Clustering Alloy on ECS Grafana Alloy	1	28	December 12, 2025
Help setting up Alloy v1.0 Grafana config-help	0	176	April 18, 2024
Testing Alloy Connection Oops error, on new Ubuntu VM Grafana Alloy linux , grafana-cloud , alloy	13	705	April 10, 2025
Alloy Clustering Mode on kubernetes Grafana Alloy	2	35	January 12, 2026
No data for Grafana Alloy cloud fresh install Installation alloy	6	614	August 19, 2024

Grafana Alloy fail to join cluster with service discovery

Related topics