Hi,
My production is running on a docker swarm cluster.
I’m using Loki since few months in standalone service, scaled at 1.
My objectives are simple :
- use the “split” mode arrived on 2.4x
- create a real loki stack
- increase routed logs to loki
- be able to scale up easily loki services
Storage is assured by a powerfull minio cluster.
My swarm stack is :
loki-read:
image: grafana/loki:2.4.1
volumes:
- type: bind
source: ./loki.yml
target: /etc/loki/loki.yml
read_only: true
command: "-config.file=/etc/loki/loki.yml -target=read"
environment:
TZ: Europe/Paris
networks:
internal:
log-public:
monitoring-public:
deploy:
mode: replicated
replicas: 2
resources:
limits:
# cpus: "0.50"
memory: 512M
placement:
constraints:
- "node.role==worker"
labels:
prometheus_enable: "false"
prometheus_scheme: http
prometheus_port: 3100
prometheus_path: /metrics
traefik.enable: "false"
loki-write:
image: grafana/loki:2.4.1
volumes:
- type: bind
source: ./loki.yml
target: /etc/loki/loki.yml
read_only: true
command: "-config.file=/etc/loki/loki.yml -target=write"
environment:
TZ: Europe/Paris
networks:
internal:
log-public:
monitoring-public:
deploy:
mode: replicated
replicas: 2
resources:
limits:
# cpus: "0.50"
memory: 512M
placement:
constraints:
- "node.role==worker"
labels:
prometheus_enable: "false"
prometheus_scheme: http
prometheus_port: 3100
prometheus_path: /metrics
traefik.enable: "false"
My loki configuration is the same i use for my actual standalone stack, just added the “memberlist” section:
---
auth_enabled: false
server:
log_level: warn
http_listen_address: 0.0.0.0
http_listen_port: 3100
grpc_listen_address: 0.0.0.0
grpc_listen_port: 9095
memberlist:
join_members:
- loki-read:7946
- loki-write:7946
bind_addr:
- 0.0.0.0
bind_port: 7946
randomize_node_name: false
schema_config:
configs:
- from: 2021-08-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
common:
path_prefix: /loki
replication_factor: 1
storage:
s3:
endpoint: minio-gateway:9000
insecure: true
bucketnames: loki2-data
access_key_id: loki
secret_access_key: a3HSAm9mHQl67QVGH
s3forcepathstyle: true
ring:
instance_addr: 0.0.0.0
kvstore:
store: memberlist
ruler:
storage:
s3:
bucketnames: loki2-ruler
ingester:
lifecycler:
ring:
replication_factor: 1
kvstore:
store: memberlist
distributor:
ring:
kvstore:
store: memberlist
storage_config:
boltdb_shipper:
shared_store: s3
active_index_directory: /tmp/loki/index
cache_location: /tmp/loki/boltdb-cache
boltdb:
directory: /loki/index
filesystem:
directory: /loki/chunks
index_cache_validity: 10m
index_queries_cache_config:
redis:
endpoint: redis:6379
timeout: 5s
db: 1
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 720h
ingestion_rate_mb: 16
ingestion_burst_size_mb: 24
max_entries_limit_per_query: 50000
max_query_parallelism: 16
chunk_store_config:
chunk_cache_config:
redis:
endpoint: redis:6379
timeout: 5s
db: 2
pool_size: 1000
write_dedupe_cache_config:
redis:
endpoint: redis:6379
timeout: 5s
db: 3
pool_size: 1000
cache_lookups_older_than: 1h
frontend:
log_queries_longer_than: 15s
compress_responses: true
querier:
query_timeout: 2m
query_ingesters_within: 2h
table_manager:
index_tables_provisioning:
enable_ondemand_throughput_mode: true
enable_inactive_throughput_on_demand_mode: true
retention_deletes_enabled: true
retention_period: 2160h
query_range:
split_queries_by_interval: 5m
align_queries_with_step: true
max_retries: 5
parallelise_shardable_queries: true
cache_results: true
results_cache:
cache:
redis:
endpoint: redis:6379
timeout: 5s
db: 0
I’ve not published the rest of this stack (a redis and a haproxy as LB).
Loki won’t start and here the log output :
loki2_loki-write.2.ce7ikf37czsv@srv-swarm-worker1.infra.ginhoux.net | level=warn ts=2021-12-27T19:57:08.174990297Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.18.0.40:7946 err="dial tcp 172.18.0.40:7946: i/o timeout"
loki2_loki-write.1.pms6holr5wq0@srv-swarm-worker3.infra.ginhoux.net | level=warn ts=2021-12-27T19:57:12.382565809Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.18.0.40:7946 err="dial tcp 172.18.0.40:7946: i/o timeout"
loki2_loki-write.2.ce7ikf37czsv@srv-swarm-worker1.infra.ginhoux.net | level=warn ts=2021-12-27T19:57:13.176059225Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.18.0.40:7946 err="dial tcp 172.18.0.40:7946: i/o timeout"
loki2_loki-write.1.pms6holr5wq0@srv-swarm-worker3.infra.ginhoux.net | level=warn ts=2021-12-27T20:17:22.691953024Z caller=logging.go:72 traceID=7516d5e02133fe6a orgID=fake msg="POST /loki/api/v1/push (500) 164.919µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Connection: close; Content-Length: 611; Content-Type: application/x-protobuf; User-Agent: promtail/; "
loki2_loki-write.1.pms6holr5wq0@srv-swarm-worker3.infra.ginhoux.net | level=warn ts=2021-12-27T20:17:24.492068312Z caller=logging.go:72 traceID=48bae5efcb1f5f3b orgID=fake msg="POST /loki/api/v1/push (500) 137.403µs Response: \"at least 1 live replicas required, could only find 0\\n\" ws: false; Connection: close; Content-Length: 615; Content-Type: application/x-protobuf; User-Agent: promtail/; "
loki2_loki-read.1.5feukn6lq59a@srv-swarm-worker3.infra.ginhoux.net | level=warn ts=2021-12-27T19:56:17.306481492Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.18.0.43:7946 err="dial tcp 172.18.0.43:7946: i/o timeout"
loki2_loki-read.1.5feukn6lq59a@srv-swarm-worker3.infra.ginhoux.net | level=warn ts=2021-12-27T19:56:22.307666647Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.18.0.43:7946 err="dial tcp 172.18.0.43:7946: i/o timeout"
loki2_loki-read.1.5feukn6lq59a@srv-swarm-worker3.infra.ginhoux.net | level=warn ts=2021-12-27T19:56:27.309023161Z caller=tcp_transport.go:418 msg="TCPTransport: WriteTo failed" addr=172.18.0.43:7946 err="dial tcp 172.18.0.43:7946: i/o timeout"
After reading github issues and the recently example added (loki/production/simple-scalable at main · grafana/loki · GitHub), validated on docker-compose (not swarm), i don’t understand the mistake.
I think, it’s because i can’t set ip or interface name in the ring for advertising… but not sure.
I need your help, please.
Have a good day