Loki error on port 9095 - error contacting scheduler

I’m developing a Docker Swarm application and I have a problem with Loki.
Sometimes I get this error on the logs:

level=error caller=scheduler_processor.go:87 msg=“error contacting scheduler” err=“rpc error: code = Unavailable desc = connection error: desc= “transport: Error while dialing dial tcp 10.0.0.47:9095 i/o timeout”” addr=10.0.0.47:9095

I said that sometimes I get this error because randomly after some new deploys it runs like a charm.
The strange thing is also that the port 9095 ins’t used by anyone inside my swarm so I really don’t know why it says that cannot dial that port.

This is how I have created the service inside my docker-compose.yml:

...
mon_loki:
    image: grafana/loki:2.5.0
    hostname: mon_loki
    restart: always
    ports:
      - 3100:3100
    command: -config.file=/etc/loki/local-config.yaml
    volumes:
      - /data/loki/config.yaml:/etc/loki/config.yaml
      - /data/loki:/data/loki
    deploy:
      placement:
        constraints:
          - node.role == worker
    depends_on: 
      - mon_node-exporter
      - mon_cadvisor
    networks:
      - docker
...

This is loki’s config.yaml

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
  - from: 2020-05-15
    store: boltdb
    object_store: filesystem
    schema: v11
    index:
      prefix: index_
      period: 168h

storage_config:
  boltdb:
    directory: /tmp/loki/index

  filesystem:
    directory: /tmp/loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

I was experiencing the same issue here on Docker Swarm. In my setup the Loki container contains more than one network (the ingress network where it exposes the port 3100 and a private network with promtail). This issue seems be related with setups containing more than one network.

When inspecting the Loki configuration (with -print-config-stderr arg) I noticed that the setting instance_interface_names had all my network interfaces (on random order). Using telnet I noticed that the scheduler (that listens on port 9095) was only binding to one of the addresses which was not the address the processor was trying to connect.

As workaround I set the following properties to force the Loki components’ internal traffic use only the local interface:

common:
  instance_interface_names:
    - "lo"
  ring:
    instance_interface_names:
      - "lo"

I don’t know if this is the best approach for production environments and I’m still testing it but it seems solved this issue on my setup.

2 Likes

Thank you @rengenesio, I also have changed the network configuration in order to solve the problem.