Loki ingester unhealthy not getting autoforget

Hi Team,
I am facing a issue where ingester servers in autoscaling group when terminated, goes into LEAVING state in ring for 1 minute but after that it goes UNHEALTHY and stays unhealthy instead of getting removed/forget. I have already enabled autoforget_unhealthy: true but still facing the issue.
I want loki to automatically forget the terminated node.

Log message for autoforget:
“autoforget have seen 1 unhealthy ingesters out of 2, network may be partioned, skip forgeting ingesters this round”
Ps: I am using consul for KV and server is getting removed from consul when terminated.

Can you share your entire Loki configuration, please?

I use ASG and auto forget as well, I’ve not seen this problem before.

Hi @tonyswumac here is the configuration, pleas assume that i am replacing {{INSTANCE_IP}} with the local ip of the server.
Also kindly suggest the changes which i can implement to resolve/improve this setup.

auth_enabled: true

server:
  http_listen_port: 3100
  grpc_listen_port: 9095
  grpc_server_max_recv_msg_size: 50000000
  grpc_server_max_send_msg_size: 50000000

schema_config:
  configs:
    - from: 2025-02-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: loki-index/index_
        period: 24h
      chunks:
        prefix: loki-chunks/chunk_
        period: 24h

ingester:
  autoforget_unhealthy: true
  lifecycler:
    address: {{INSTANCE_IP}}
    ring:
      kvstore:
        store: consul
        consul:
          host: loki-write.services.in:8500
      replication_factor: 1
  wal:
    enabled: true
    dir: /var/lib/wal
  chunk_idle_period: 6h
  chunk_block_size: 5242880
  max_chunk_age: 12h
  chunk_encoding: zstd
  chunk_target_size: 20000000

common:
  compactor_address: loki-backend.services.in:9095
  path_prefix: /data/loki
  ring:
    kvstore:
      store: consul
      consul:
        host: loki-write.services.in:8500
    instance_addr: {{INSTANCE_IP}}
    replication_factor: 1

limits_config:
  max_line_size: 5MB
  retention_period: 168h
  max_query_lookback: 168h
  max_query_parallelism: 256
  max_query_length: 168h
  ingestion_rate_mb: 200
  ingestion_burst_size_mb: 500
  per_stream_rate_limit: 62428812
  per_stream_rate_limit_burst: 82428812

storage_config:
  aws:
    bucketnames: loki-logs
    region: ap-south-1
    s3forcepathstyle: true
  tsdb_shipper:
    active_index_directory: /data/loki/index
    cache_location: /data/loki/cache
    cache_ttl: 24h

compactor:
  working_directory: /tmp/loki/compactor
  compaction_interval: 5m

frontend:
  compress_responses: true

query_scheduler:
  use_scheduler_ring: true
  scheduler_ring:
    kvstore:
      store: consul
      consul:
        host: loki-write.services.in:8500
    instance_addr: {{INSTANCE_IP}}

Looking at the code, looks like it will not auto forget if the number of nodes being forgotten is the same as the number of node left. See loki/pkg/ingester/ingester.go at 837b70ac78fc1dc3e8d09f0966acb2c303dbbe35 · grafana/loki · GitHub

Try increase your cluster size (or ingester size) to 3 and see if that helps.

1 Like

ohh… is there a way where i can force it to remove unhealthy instances in for all cases? because i have one primary server and one in autoscaling for write, so when autoscaling server goes down, 1 becomes unhealthy and 1 is healthy(primary). which might be causing this issue. I want whenever autoscaling server terminates, it gracefully leaves the ring. is there a way i can achieve this?

Update: I just tried with 2 in the autoscaling and terminating one, it is working as per expectation.

I want to have 1 server in autoscaling and then i’ll put replication factor as 2, in case the autoscaling server gets replaced it goes unhealthy and majority ingesters are now unhealthy which throws whole cluster into waiting state for majority to be healthy again.

I am not aware of a way to work around that.

Personally we try not to scale down the ingesters. Unless you always hit the graceful shutdown API there is a chance of losing logs in WAL. It can be done, but we decided not to scale the ingesters for convenience. Also when it comes to logging most of the time the ingester traffic won’t differ that much based on time of the day (at least for us).

If you have write and read paths separated, I would recommend you to just keep the number of ingesters static (or only allow it to scale up). And If one big node is too much for your traffic, consider using two smaller nodes instead and always keep at least 2 (this also gives you some redundancy at least).