Loki component="memberlist TCPTransport" msg="failed to read message type" err=EOF

Hello,

I am trying to deploy three Loki v.2.9.4 instances in monolithic mode on my nomad-cluster, using docker. Each instance is running on a different nomad node. I am facing some issues with the memberlist component, any help is very appreciated.
Once I spin up the Loki instances the log is filled with the following warn message and continues to do so:

level=warn ts=2024-02-09T11:32:27.228361615Z caller=tcp_transport.go:253 component="memberlist TCPTransport" msg="failed to read message type" err=EOF remote=172.17.0.1:40984

level=warn ts=2024-02-09T11:32:29.229249009Z caller=tcp_transport.go:253 component="memberlist TCPTransport" msg="failed to read message type" err=EOF remote=172.17.0.1:49542

level=warn ts=2024-02-09T11:32:31.230123383Z caller=tcp_transport.go:253 component="memberlist TCPTransport" msg="failed to read message type" err=EOF remote=172.17.0.1:49556

level=warn ts=2024-02-09T11:32:33.231003539Z caller=tcp_transport.go:253 component="memberlist TCPTransport" msg="failed to read message type" err=EOF remote=172.17.0.1:49572

level=warn ts=2024-02-09T11:32:35.231360847Z caller=tcp_transport.go:253 component="memberlist TCPTransport" msg="failed to read message type" err=EOF remote=172.17.0.1:49574

level=warn ts=2024-02-09T11:32:37.23184231Z caller=tcp_transport.go:253 component="memberlist TCPTransport" msg="failed to read message type" err=EOF remote=172.17.0.1:49586

Loki is trying to connect to the docker host ip address instead of the address of the container as you can see with 172.17.0.1
The members are active and are discovered through consul and the DNS SRV lookup.

Ring status:

Running configuration

auth_enabled: false

      querier:
        max_concurrent: 16
        
      query_scheduler:
        max_outstanding_requests_per_tenant: 32768

      ingester:
        chunk_idle_period: 5m
        wal:
          dir: /loki/ingester-wal

      storage_config:
        tsdb_shipper:
          active_index_directory: /loki/tsdb-index
          cache_location: /loki/tsdb-cache
          shared_store: s3

      schema_config:
        configs:
          - from: 2024-01-01
            index:
              period: 24h
              prefix: index_
            object_store: s3
            schema: v12
            store: tsdb

      compactor:
        working_directory: /loki/compactor
        retention_enabled: true

      limits_config:
        enforce_metric_name: false
        reject_old_samples: true
        reject_old_samples_max_age: 168h
        retention_period: 4320h
      
      memberlist:
        node_name: loki0
        join_members:
          - _loki._memberlist.service.consul.io

      common:
        ring:
          kvstore:
            store: memberlist
        instance_addr: 10.41.0.226
        storage:
          s3:
            s3forcepathstyle: true
            bucketnames: "loki"
            endpoint: https://URL
            region: eu-de`

I haven’t used nomad before, so I can’t really tell you how to fix it. Quick glance at Nomad it doesn’t seem to have its own networking layer like Kubernetes do, so it may be a bit tricky.

Essentially, the entry you have in memberlist needs to either be one entry per instance, or something that can be resolved to contain all instances.

For example, looking at your configuration it looks like a SRV record, so you should follow the documentation here and configure service discovery (About Grafana Mimir DNS service discovery | Grafana Mimir documentation), and it should look like this (not tested):

join_members:
  - dnssrv+_loki._memberlist.service.consul.io

Also your screenshot is not available.

It seems like we’ve encountered a layer 8 issue here!
As you may be aware, Nomad utilizes health checks through TCP connections to monitor the status of running services. Something like this:

    network {
      port "memberlist" {
        to = 7946
      }
    }
    service {
      name = "loki"
      port = "memberlist"

      check {
        type     = "tcp"
        interval = "2s"
        timeout  = "2s"
      }
    }

Nomad connects via TCP to the port 7946 and Loki doesn’t know what to do with an empty TCP-request, that’s why I received the EOF messages in the Log.
Thank you for helping!
I didn’t know about the service discovery via dnssrv+ , that’s going to save me some trouble later.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.