Split loki read/write and scaling on docker swarm

Hi,

My production is running on a docker swarm cluster.
I’m using Loki since few months in standalone service, scaled at 1.

My objectives are simple :

  • use the “split” mode arrived on 2.4x
  • create a real loki stack
  • increase routed logs to loki
  • be able to scale up easily loki services

Storage is assured by a powerfull minio cluster.

My swarm stack is :

  loki-read:
    image: grafana/loki:2.4.1
    volumes:
      - type: bind
        source: ./loki.yml
        target: /etc/loki/loki.yml
        read_only: true
    command: "-config.file=/etc/loki/loki.yml -target=read"
    environment:
      TZ: Europe/Paris
    networks:
      internal:
      log-public:
      monitoring-public:
    deploy:
      mode: replicated
      replicas: 2
      resources:
        limits:
          # cpus: "0.50"
          memory: 512M
      placement:
        constraints:
          - "node.role==worker"
      labels:
        prometheus_enable: "false"
        prometheus_scheme: http
        prometheus_port: 3100
        prometheus_path: /metrics
        traefik.enable: "false"

  loki-write:
    image: grafana/loki:2.4.1
    volumes:
      - type: bind
        source: ./loki.yml
        target: /etc/loki/loki.yml
        read_only: true
    command: "-config.file=/etc/loki/loki.yml -target=write"
    environment:
      TZ: Europe/Paris
    networks:
      internal:
      log-public:
      monitoring-public:
    deploy:
      mode: replicated
      replicas: 2
      resources:
        limits:
          # cpus: "0.50"
          memory: 512M
      placement:
        constraints:
          - "node.role==worker"
      labels:
        prometheus_enable: "false"
        prometheus_scheme: http
        prometheus_port: 3100
        prometheus_path: /metrics
        traefik.enable: "false"

My loki configuration is the same i use for my actual standalone stack, just added the “memberlist” section:

---
auth_enabled: false
server:
  log_level: warn
  http_listen_address: 0.0.0.0
  http_listen_port: 3100
  grpc_listen_address: 0.0.0.0
  grpc_listen_port: 9095
memberlist:
  join_members:
    - loki-read:7946
    - loki-write:7946
  bind_addr: 
    - 0.0.0.0
  bind_port: 7946
  randomize_node_name: false
schema_config:
  configs:
    - from: 2021-08-01
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: index_
        period: 24h
common:
  path_prefix: /loki
  replication_factor: 1
  storage:
    s3:
      endpoint: minio-gateway:9000
      insecure: true
      bucketnames: loki2-data
      access_key_id: loki
      secret_access_key: a3HSAm9mHQl67QVGH
      s3forcepathstyle: true
  ring:
    instance_addr: 0.0.0.0
    kvstore:
      store: memberlist
ruler:
  storage:
    s3:
      bucketnames: loki2-ruler
ingester:
  lifecycler:
    ring:
      replication_factor: 1
      kvstore:
        store: memberlist
distributor:
  ring:
    kvstore:
      store: memberlist
storage_config:
  boltdb_shipper:
    shared_store: s3
    active_index_directory: /tmp/loki/index
    cache_location: /tmp/loki/boltdb-cache
  boltdb:
    directory: /loki/index
  filesystem:
    directory: /loki/chunks
  index_cache_validity: 10m
  index_queries_cache_config:
    redis:
      endpoint: redis:6379
      timeout: 5s
      db: 1
limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 720h
  ingestion_rate_mb: 16
  ingestion_burst_size_mb: 24
  max_entries_limit_per_query: 50000
  max_query_parallelism: 16
chunk_store_config:
  chunk_cache_config:
    redis:
      endpoint: redis:6379
      timeout: 5s
      db: 2
      pool_size: 1000
  write_dedupe_cache_config:
    redis:
      endpoint: redis:6379
      timeout: 5s
      db: 3
      pool_size: 1000
  cache_lookups_older_than: 1h
frontend:
  log_queries_longer_than: 15s
  compress_responses: true
querier:
  query_timeout: 2m
  query_ingesters_within: 2h
table_manager:
  index_tables_provisioning:
    enable_ondemand_throughput_mode: true
    enable_inactive_throughput_on_demand_mode: true
  retention_deletes_enabled: true
  retention_period: 2160h
query_range:
  split_queries_by_interval: 5m
  align_queries_with_step: true
  max_retries: 5
  parallelise_shardable_queries: true
  cache_results: true
  results_cache:
    cache:
      redis:
        endpoint: redis:6379
        timeout: 5s
        db: 0

I’ve not published the rest of this stack (a redis and a haproxy as LB).
Loki won’t start and here the log output :

loki2_loki-write.2.ce7ikf37czsv@srv-swarm-worker1.infra.ginhoux.net    | level=warn ts=2021-12-27T19:57:08.174990297Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.18.0.40:7946 err="dial tcp 172.18.0.40:7946: i/o timeout"
loki2_loki-write.1.pms6holr5wq0@srv-swarm-worker3.infra.ginhoux.net    | level=warn ts=2021-12-27T19:57:12.382565809Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.18.0.40:7946 err="dial tcp 172.18.0.40:7946: i/o timeout"
loki2_loki-write.2.ce7ikf37czsv@srv-swarm-worker1.infra.ginhoux.net    | level=warn ts=2021-12-27T19:57:13.176059225Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.18.0.40:7946 err="dial tcp 172.18.0.40:7946: i/o timeout"

loki2_loki-write.1.pms6holr5wq0@srv-swarm-worker3.infra.ginhoux.net    | level=warn ts=2021-12-27T20:17:22.691953024Z caller=logging.go:72 traceID=7516d5e02133fe6a orgID=fake msg="POST /loki/api/v1/push (500) 164.919µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Connection: close; Content-Length: 611; Content-Type: application/x-protobuf; User-Agent: promtail/; "
loki2_loki-write.1.pms6holr5wq0@srv-swarm-worker3.infra.ginhoux.net    | level=warn ts=2021-12-27T20:17:24.492068312Z caller=logging.go:72 traceID=48bae5efcb1f5f3b orgID=fake msg="POST /loki/api/v1/push (500) 137.403µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Connection: close; Content-Length: 615; Content-Type: application/x-protobuf; User-Agent: promtail/; "
loki2_loki-read.1.5feukn6lq59a@srv-swarm-worker3.infra.ginhoux.net    | level=warn ts=2021-12-27T19:56:17.306481492Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.18.0.43:7946 err="dial tcp 172.18.0.43:7946: i/o timeout"
loki2_loki-read.1.5feukn6lq59a@srv-swarm-worker3.infra.ginhoux.net    | level=warn ts=2021-12-27T19:56:22.307666647Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.18.0.43:7946 err="dial tcp 172.18.0.43:7946: i/o timeout"
loki2_loki-read.1.5feukn6lq59a@srv-swarm-worker3.infra.ginhoux.net    | level=warn ts=2021-12-27T19:56:27.309023161Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.18.0.43:7946 err="dial tcp 172.18.0.43:7946: i/o timeout"

After reading github issues and the recently example added (loki/production/simple-scalable at main · grafana/loki · GitHub), validated on docker-compose (not swarm), i don’t understand the mistake.

I think, it’s because i can’t set ip or interface name in the ring for advertising… but not sure.

I need your help, please.

Have a good day