Hi Team,
I am facing a issue where ingester servers in autoscaling group when terminated, goes into LEAVING state in ring for 1 minute but after that it goes UNHEALTHY and stays unhealthy instead of getting removed/forget. I have already enabled autoforget_unhealthy: true but still facing the issue.
I want loki to automatically forget the terminated node.
Log message for autoforget:
“autoforget have seen 1 unhealthy ingesters out of 2, network may be partioned, skip forgeting ingesters this round”
Ps: I am using consul for KV and server is getting removed from consul when terminated.
Hi @tonyswumac here is the configuration, pleas assume that i am replacing {{INSTANCE_IP}} with the local ip of the server.
Also kindly suggest the changes which i can implement to resolve/improve this setup.
ohh… is there a way where i can force it to remove unhealthy instances in for all cases? because i have one primary server and one in autoscaling for write, so when autoscaling server goes down, 1 becomes unhealthy and 1 is healthy(primary). which might be causing this issue. I want whenever autoscaling server terminates, it gracefully leaves the ring. is there a way i can achieve this?
Update: I just tried with 2 in the autoscaling and terminating one, it is working as per expectation.
I want to have 1 server in autoscaling and then i’ll put replication factor as 2, in case the autoscaling server gets replaced it goes unhealthy and majority ingesters are now unhealthy which throws whole cluster into waiting state for majority to be healthy again.
Personally we try not to scale down the ingesters. Unless you always hit the graceful shutdown API there is a chance of losing logs in WAL. It can be done, but we decided not to scale the ingesters for convenience. Also when it comes to logging most of the time the ingester traffic won’t differ that much based on time of the day (at least for us).
If you have write and read paths separated, I would recommend you to just keep the number of ingesters static (or only allow it to scale up). And If one big node is too much for your traffic, consider using two smaller nodes instead and always keep at least 2 (this also gives you some redundancy at least).