Hi Team,
I am facing a issue where ingester servers in autoscaling group when terminated, goes into LEAVING state in ring for 1 minute but after that it goes UNHEALTHY and stays unhealthy instead of getting removed/forget. I have already enabled autoforget_unhealthy: true but still facing the issue.
I want loki to automatically forget the terminated node.
Log message for autoforget:
“autoforget have seen 1 unhealthy ingesters out of 2, network may be partioned, skip forgeting ingesters this round”
Ps: I am using consul for KV and server is getting removed from consul when terminated.
Can you share your entire Loki configuration, please?
I use ASG and auto forget as well, I’ve not seen this problem before.
Hi @tonyswumac here is the configuration, pleas assume that i am replacing {{INSTANCE_IP}} with the local ip of the server.
Also kindly suggest the changes which i can implement to resolve/improve this setup.
auth_enabled: true
server:
http_listen_port: 3100
grpc_listen_port: 9095
grpc_server_max_recv_msg_size: 50000000
grpc_server_max_send_msg_size: 50000000
schema_config:
configs:
- from: 2025-02-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki-index/index_
period: 24h
chunks:
prefix: loki-chunks/chunk_
period: 24h
ingester:
autoforget_unhealthy: true
lifecycler:
address: {{INSTANCE_IP}}
ring:
kvstore:
store: consul
consul:
host: loki-write.services.in:8500
replication_factor: 1
wal:
enabled: true
dir: /var/lib/wal
chunk_idle_period: 6h
chunk_block_size: 5242880
max_chunk_age: 12h
chunk_encoding: zstd
chunk_target_size: 20000000
common:
compactor_address: loki-backend.services.in:9095
path_prefix: /data/loki
ring:
kvstore:
store: consul
consul:
host: loki-write.services.in:8500
instance_addr: {{INSTANCE_IP}}
replication_factor: 1
limits_config:
max_line_size: 5MB
retention_period: 168h
max_query_lookback: 168h
max_query_parallelism: 256
max_query_length: 168h
ingestion_rate_mb: 200
ingestion_burst_size_mb: 500
per_stream_rate_limit: 62428812
per_stream_rate_limit_burst: 82428812
storage_config:
aws:
bucketnames: loki-logs
region: ap-south-1
s3forcepathstyle: true
tsdb_shipper:
active_index_directory: /data/loki/index
cache_location: /data/loki/cache
cache_ttl: 24h
compactor:
working_directory: /tmp/loki/compactor
compaction_interval: 5m
frontend:
compress_responses: true
query_scheduler:
use_scheduler_ring: true
scheduler_ring:
kvstore:
store: consul
consul:
host: loki-write.services.in:8500
instance_addr: {{INSTANCE_IP}}
Looking at the code, looks like it will not auto forget if the number of nodes being forgotten is the same as the number of node left. See loki/pkg/ingester/ingester.go at 837b70ac78fc1dc3e8d09f0966acb2c303dbbe35 · grafana/loki · GitHub
Try increase your cluster size (or ingester size) to 3 and see if that helps.
1 Like
ohh… is there a way where i can force it to remove unhealthy instances in for all cases? because i have one primary server and one in autoscaling for write, so when autoscaling server goes down, 1 becomes unhealthy and 1 is healthy(primary). which might be causing this issue. I want whenever autoscaling server terminates, it gracefully leaves the ring. is there a way i can achieve this?
Update: I just tried with 2 in the autoscaling and terminating one, it is working as per expectation.
I want to have 1 server in autoscaling and then i’ll put replication factor as 2, in case the autoscaling server gets replaced it goes unhealthy and majority ingesters are now unhealthy which throws whole cluster into waiting state for majority to be healthy again.
I am not aware of a way to work around that.
Personally we try not to scale down the ingesters. Unless you always hit the graceful shutdown API there is a chance of losing logs in WAL. It can be done, but we decided not to scale the ingesters for convenience. Also when it comes to logging most of the time the ingester traffic won’t differ that much based on time of the day (at least for us).
If you have write and read paths separated, I would recommend you to just keep the number of ingesters static (or only allow it to scale up). And If one big node is too much for your traffic, consider using two smaller nodes instead and always keep at least 2 (this also gives you some redundancy at least).