I have Loki giving me this warning:
level=warn ts=2023-10-12T17:24:00.469737712Z caller=logging.go:123
msg="POST /loki/api/v1/push (500) 859.661µs
Response: \"at least 1 live replicas required, could only find 0
- unhealthy instances: 10.254.0.86:65300\\n\"
X-Internal-Remote-Address: 22.214.171.124; "
I have three instances running. Notice the
unhealthy instances: 10.254.0.86. That IP was used by a previous instance of Loki, but it has since been terminated. Why are the current instances of Loki still looking for old instances? And more importantly, how can I tell them to stop looking for the old instance and only look at the current instances?
If I visit the
/ready endpoint, I see the message:
Ingester not ready: instance 10.254.0.56:63552 past heartbeat timeout
And in the logs I see this entry:
level=warn ts=2023-10-12T18:23:29.021351537Z caller=lifecycler.go:291
msg="found an existing instance(s) with a problem in the ring, this
instance cannot become ready until this problem is resolved.
The /ring http endpoint on the distributor (or single binary)
provides visibility into the ring."
err="instance 10.254.0.56:63552 past heartbeat timeout"
If I go to the
/distributor/ring endpoint I can see the “Ring Status” page. I click the “Forget” button on all the instances and then refresh and they come back. But in the logs it is still complaining about some unhealthy instances that no longer exist.
Thanks to this issue I found the
ingester.autoforget_unhealthy: true configuration parameter.
level=info ts=2023-10-12T18:45:09.898482094Z caller=ingester.go:390 msg="autoforget removed ingester old-loki-instance-001 from the ring because it was not healthy after 1m0s"
level=info ts=2023-10-12T18:45:09.898511469Z caller=ingester.go:390 msg="autoforget removed ingester old-loki-instance-002 from the ring because it was not healthy after 1m0s"
level=info ts=2023-10-12T18:45:09.898519995Z caller=ingester.go:390 msg="autoforget removed ingester old-loki-instance-003 from the ring because it was not healthy after 1m0s"
The frustrating thing is that none of those showed up at the
Is there another/better way to see and manually “forget” instances that go unhealthy and never return?