Loki 2.4.1 empty ring Code(500) error for "GET /loki/api/v1/labels" API on AWS ECS

Hi,

Have searched through
grafana community forums

  • 76336
  • 75544
  • 75568

and github issues

  • 7496

Unable to find resolution of the below mentioned problem which got closed on issue page, hence navigated to forums.
Glad if get relevant guidance.

Loki 2.4.1 empty ring Code(500) error for “GET /loki/api/v1/labels” API on AWS ECS #8986

The empty ring error is an indication that one or more components have failed to startup, and therefore failed to register with the consistent hash ring.

I’d suggest checking your Loki installation for errors. The easiest way to do this is run Loki with the flag -log.level=error.

Can you please try that and see if you can find any relevant log lines?

Also, Loki v2.4.1 is very old at this point (v2.8.0 is the latest release), and we generally do not support much older versions.

1 Like

Loki Upgrade

  • Upgraded to grafana/loki:2.7.5.
  • grafana/loki:2.8.0 threw below error while loading docker image
=> ERROR [internal] load metadata for docker.io/grafana/loki:2.8.0
failed to solve with frontend dockerfile.v0: failed to create LLB definition: docker.io/grafana/loki:2.8.0: not found

API error (/loki/api/v1/labels)

  • -log.level=error gives no error till /loki/api/v1/labels API is called.
level=error ts=2023-04-01T14:20:13.137593213Z caller=series_index_store.go:583 org_id=loki_docker msg="error querying storage" err="context canceled"
level=error ts=2023-04-01T14:20:13.138513539Z caller=retry.go:73 org_id=loki_docker msg="error processing request" try=0 err="rpc error: code = Code(500) desc = empty ring\n"

Some API is internally retried for 5 times, starting try=0, hence the code = Code(500) desc = empty ring\n error is printed 5 times. Log shortened to save duplication.

  • Think issue might be pertaining to incorrect configured value for join_members on ECS

  • Local docker-compose as below, allows creation of custom container network named loki, which is used by loki components to communicate in ring. Hence, working fine on local.

    • docker-compose.yaml (local)
networks:
  loki:

services:
  read:
    image: grafana/loki:2.7.5
    command: "-config.file=/etc/loki/config.yaml -target=read -config.expand-env=true"
    ports:
      - 3100
      - 7946
      - 9095
    volumes:
      - ./loki-config.yaml:/etc/loki/config.yaml
    networks: &loki-dns
      loki:
        aliases:
          - loki
  • loki-config.yaml (local)
memberlist:
  join_members:
    - loki:7946
  • Whereas ECS task definition doesn’t have this feature (Am i correct ?)
  • Any specific configuration for join_members on ECS ?
  • Currently it is set to dns+127.0.0.1:7946 for service discovery in case of scaling.
  • And instance_interface_names to lo.
  • Tried with docker0 and eth0 as well.

Apologies for the v2.8.0 confusion; I think it may not be fully released yet (coming soon).

I would suggest that you try ensure that your Loki containers can communicate with each other over the host:port definition that you’ve set in your memberlist config. You can use netcat for this, which each Loki container has installed by default:

  • shell into a Loki container
  • run nc -vz <host> 7946 and see what the response is or exit code (echo $?)

If that doesn’t work, there’s a network issue and the ring members cannot join each other.
I’m not aware of any ECS-specific setup instructions.

I think you can also get rid of dns+ in your dns+127.0.0.1 definition because 127.0.0.1 is not a domain name.

I deploy our Loki cluster using ECS as well, and I can tell you it’s less than obvious when dealing with ring membership because of the lack of built-in service discovery. That’s most likely where your errors come from. Make sure your writers have dedicated IPs (which means either host mode or AWSVPC mode), and you need to make sure all containers within Loki cluster have connectivity between each other.

Need to see both your Loki configuration (particularly the ring part) as well as your ECS task definition / service definition.

  • Appreciate pointing out. Am learning networking from ground up.

  • Here is my analysis for netcat on docker local and ECS

docker local (For reference as loki working fine)

  • loki is custom docker-compose network as mentioned here
  • nc in read container.
/ # nc -vvz loki 7946
loki (172.20.0.2:7946) open
sent 0, rcvd 0
/ # nc -vvz loki 7946
loki (172.20.0.3:7946) open
sent 0, rcvd 0
/ # nc -vvz loki 7946
loki (172.20.0.3:7946) open
sent 0, rcvd 0
  • nc in write container
/ # nc -vvz loki 7946
loki (172.20.0.3:7946) open
sent 0, rcvd 0
/ # nc -vvz loki 7946
loki (172.20.0.2:7946) open
sent 0, rcvd 0
/ # nc -vvz loki 7946
loki (172.20.0.3:7946) open
sent 0, rcvd 0

IP address of loki seem to toggle for every command. (172.20.0.2 and 172.20.0.3 are write and read container IPs respectively.)

  • Also tried netcat in listen mode. read and write as listener and client interchangeably. Achieved bilateral communication on loki network.

ECS docker container

  • nc in read container
/ # nc -vvz 127.0.0.1 7946
127.0.0.1 (127.0.0.1:7946) open
sent 0, rcvd 0
/ # nc -vvz 172.17.0.2 7946
172.17.0.2 (172.17.0.2:7946) open
sent 0, rcvd 0
  • nc in write container
/ # nc -vvz 127.0.0.1 7946
127.0.0.1 (127.0.0.1:7946) open
sent 0, rcvd 0
/ # nc -vvz 172.17.0.3 7946
172.17.0.3 (172.17.0.3:7946) open
sent 0, rcvd 0

127.0.0.1 is memberlist.join config IP
172.17.0.2 and 172.17.0.3 are write and read container IPs respectively

  • Hence am assuming communication is working on 127.0.0.1.
  • Bilateral communication worked here too, but had to provide container IP addresses explicitly, unlike network name in docker local example above.
  • Where am i missing exactly ?
  • Above quoted post with github issue page (#8986) has all the supporting data including loki-config.yaml and nginx.conf
  • ECS task definition below.
{
  ...
  "containerDefinitions": [
    {
      ...
      "entryPoint": [
        "/usr/bin/loki"
      ],
      "portMappings": [
        {
          "hostPort": 0,
          "protocol": "tcp",
          "containerPort": 3100
        },
        {
          "hostPort": 0,
          "protocol": "tcp",
          "containerPort": 7946
        },
        {
          "hostPort": 0,
          "protocol": "tcp",
          "containerPort": 9095
        }
      ],
      "command": [
        "-config.file=/etc/loki/config.yaml",
        "-target=read",
        "-config.expand-env=true"
      ],
      ...
      "environment": [
        {
          "name": "instance_interface_names1",
          "value": "lo"
        },
        {
          "name": "join_members",
          "value": "127.0.0.1:7946"
        }
      ],
      ...
      "image": "loki-2.7.5-image",
      "disableNetworking": false,
      "essential": true,
      "name": "read"
    },
    {
      "entryPoint": [
        "/usr/bin/loki"
      ],
      "portMappings": [
        {
          "hostPort": 0,
          "protocol": "tcp",
          "containerPort": 3100
        },
        {
          "hostPort": 0,
          "protocol": "tcp",
          "containerPort": 7946
        },
        {
          "hostPort": 0,
          "protocol": "tcp",
          "containerPort": 9095
        }
      ],
      "command": [
        "-config.file=/etc/loki/config.yaml",
        "-target=write",
        "-config.expand-env=true"
      ],
      ...
      "environment": [
        {
          "name": "instance_interface_names1",
          "value": "lo"
        },
        {
          "name": "join_members",
          "value": "127.0.0.1:7946"
        }
      ],
      ...
      "image": "loki-2.7.5-image",
      "disableNetworking": false,
      "essential": true,
      "name": "write"
    },
    {
      "entryPoint": [
        nginx-config
      ],
      "portMappings": [
        {
          "hostPort": 3100,
          "protocol": "tcp",
          "containerPort": 3100
        }
      ],
      ...
      "image": "nginx:latest",
      "dependsOn": [
        {
          "containerName": "read",
          "condition": "START"
        },
        {
          "containerName": "write",
          "condition": "START"
        }
      ],
      "disableNetworking": false,
      "essential": true,
      "links": [
        "read:read",
        "write:write"
      ],
      "name": "gateway"
    }
  ],
  ...
  "requiresAttributes": [
    {
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.17"
    },
    {
      "name": "ecs.capability.container-ordering"
    }
  ],
  "networkMode": "bridge",
}

Unset variables, default null values and sensitive information is excluded for brevity and security.

  • Currently running in bridge mode for dynamic port mapping and having envisioned to scale it further.
  • From what i can understand if using host or awsvpc would consume the three container ports restricting scaling on same EC2 instance.

That is not going to work. All containers within Loki cluster has to be able to connect to the writer containers “directly”. In order to do so, your containers have to be able to discover the writers through service discovery. If you look at the configuration here (About Grafana Mimir DNS service discovery | Grafana Mimir documentation), you’ll notice that even with SRV discovery it doesn’t propagate the port which means you have to use native port (meaning no bridge) for your writers. Second, if you are using simple scalable mode it does not make sense to have writer and reader in one ECS service. The advantage of simple scalable mode is so that you can scale writers and readers separately, by putting them on the same ECS service you kinda eliminated that.

Short of giving you our code (work related therefore can’t really do that), here is what I would recommend you to do:

  1. Separate your writer and readers. The easiest way to do this would be to setup one ECS cluster but two autoscaling groups. You can give them different tags (google ECS_INSTANCE_ATTRIBUTES) like this:
echo 'ECS_INSTANCE_ATTRIBUTES={"loki-instance-catogery": "writer"}' >> /etc/ecs/ecs.config

and configure your ECS service to go to those instances with placement_constraints:

  placement_constraints {
    type       = "memberOf"
    expression = "attribute:loki-instance-catogery == writer"
  }
  1. Since writers need dedicated persistent volume for WAL, you might as well just give them dedicated host. Therefore I recommend setting writers to DAEMON type. You’ll want a service discovery zone (say writers.services.discovery) for your writers (with A record, don’t do SRV), and you’ll want AWSVPC network mode because you need native port.

  2. For readers I’d recommend creating another service discovery zone as well (with A record, say readers.services.discovery), with AWSVPC network mode, but you can set readers to REPLICA mode.

  3. In your Loki configuration, configure memberlist like so:

memberlist:
  bind_addr: ['0.0.0.0']
  bind_port: 7946
  dead_node_reclaim_time: 30s
  gossip_interval: 2s
  gossip_to_dead_nodes_time: 15s
  left_ingesters_timeout: 30s
  join_members:
  - dns+writers.services.discovery:7946
  - dns+readers.services.discovery:7946

Lastly, you can overcome the NIC limitation of AWSVPC mode. See Elastic network interface trunking - Amazon Elastic Container Service.

2 Likes

Seems something to digest and implement in first run. But equally determined to try and make this work, now with relative visibility and affirmation that separate reader and writer does work.

Though initially had tried to make this architecture run (separate reader and writer service on ECS as well on docker), but failed to boot respective containers (categorically incorrect loki-config).

Will need more efforts and experimenting on my loki-config.

Will update my results. Appreciate !!

1 Like
  • Separate read and write instances are up but experiencing below error which seems memberlist and thus service discovery related.
level=warn ts=2023-04-06T15:37:06.972279357Z caller=resolver.go:133 msg="IP address lookup yielded no results. No host found or no addresses found" host=readers.loki-services.discovery
level=warn ts=2023-04-06T15:37:06.972299779Z caller=memberlist_client.go:601 msg="joining memberlist cluster: found no nodes to join" retries=1
level=error ts=2023-04-06T15:37:10.937910542Z caller=resolver.go:86 msg="failed to lookup IP addresses" host=writers.loki-services.discovery err="lookup writers.loki-services.discovery on 10.0.0.2:53: no such host"
level=warn ts=2023-04-07T09:49:01.494027897Z caller=tcp_transport.go:428 component="memberlist TCPTransport" msg="WriteTo failed" addr=10.0.0.138:7946 err="dial tcp 10.0.0.138:7946: i/o timeout"
level=warn ts=2023-04-07T09:52:21.494670174Z caller=tcp_transport.go:428 component="memberlist TCPTransport" msg="WriteTo failed" addr=10.0.0.142:7946 err="dial tcp 10.0.0.142:7946: i/o timeout"
  • Cloudmap namespace and corresponding services with A record were configured in same vpc as cluster and ECS service.
  • Any way to troubleshoot this ?

Error is host not found for readers.loki-services.discovery. I’d start with a dig command from ecs host and see if it resolves, check your DNS configuration, etc.

1 Like

Architecture implemented and working. The solution did provide considerable clarity on some ECS and Loki concepts and thus its wider applications. Appreciate the guidance once again !

PN:-

  1. The service discovery error happened to resolve once below subnet level setting was enabled (dig and dig -x resolved at both read/write container level) and security group settings were modified.
    Subnet setting →

Enable resource name DNS A record on launch

  1. Currently trying to figure out NIC limitation but seemingly in manageable state now.
1 Like

Hi!

I know I kind of high jack this here, but I’m looking after an error I could not hunt down a few months ago.

I got the following question regarding the simple-scalable deployment mode. I’m not a native english speaker, so I’m not sure whether I misunderstood something or if I’m up to something.

Second, if you are using simple scalable mode it does not make sense to have writer and reader in one ECS service.

I’m not running my setup with ECS, but within a local kubernetes cluster. But you seem to imply here, that it’s not a good idea to run target read and write services on the same node. Did I misunderstand you, or is that correct?

Explanation: I tried to deploy 5 read targets and 5 write targets among equally among 5 nodes, which lead to some nodes having a read and a write service running.

Thanks in advance!
Sebastian

PS: If I got mislead, I’ll delete the message from this thread.

My comment is specific to ECS cluster. You can absolutely run read and writer services on the same node, even on ECS. My comment wasn’t so much that you can’t run them on the same node, rather that they should not be “defined” in the same “ECS service”.

ECS doesn’t have some of the features that K8S has, so it relies on some external AWS services to do some of the things that K8S just does, one of them being service discovery. If two containers are defined in one ECS service, you get one service discovery, meaning you’d be using the same discovery record for both read and write services. This creates a conflict in that you can’t discover the services separately, as well as a scaling problem where you can’t scale them separately.

1 Like

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.