Hi! We’re facing an issue with loki-write and we want to know if the ‘autoforget_unhealthy’ flag could help us. A few days ago, we encountered problems in our Kubernetes cluster when we lost a worker that was running a Loki write pod. The error indicated that it was still in the ring, and the only way to resolve it was to access ‘/ring’ and manually forget it.
While researching the Loki documentation, we came across the ‘autoforget_unhealthy’ and we have some questions about its recommended usage, especially in the context of a StatefulSet installation of loki-write.
We did some test with this flag in our dev environment, stopping the container of loki-write through the ec2 instance. It was successfully removed from the ring and then when the container started up the instance went back to the ring too, so everything worked fine.
Do you have any piece of advice? Any recommendation would be greatly appreciated!
We have the following config of loki:
# Configuration for the write pod(s)
write:
# -- Number of replicas for the write
replicas: 5
autoscaling:
# -- Enable autoscaling for the write.
enabled: true
# -- Minimum autoscaling replicas for the write.
minReplicas: 3
# -- Maximum autoscaling replicas for the write.
maxReplicas: 8
# -- Target CPU utilisation percentage for the write.
targetCPUUtilizationPercentage: 80
# -- Target memory utilization percentage for the write.
targetMemoryUtilizationPercentage: 75
# -- Behavior policies while scaling.
behavior:
# -- see https://github.com/grafana/loki/blob/main/docs/sources/operations/storage/wal.md#how-to-scale-updown for scaledown details
scaleUp:
policies:
- type: Pods
value: 1
periodSeconds: 60
scaleDown:
policies:
- type: Pods
value: 1
periodSeconds: 60
stabilizationWindowSeconds: 300
image:
# -- The Docker registry for the write image. Overrides `loki.image.registry`
registry: null
# -- Docker image repository for the write image. Overrides `loki.image.repository`
repository: null
# -- Docker image tag for the write image. Overrides `loki.image.tag`
tag: null
# -- The name of the PriorityClass for write pods
priorityClassName: null
# -- Annotations for write StatefulSet
annotations: {}
# -- Annotations for write pods
podAnnotations: {}
# -- Additional labels for each `write` pod
podLabels: {}
# -- Additional selector labels for each `write` pod
selectorLabels: {}
# -- Labels for ingester service
serviceLabels: {}
# -- Comma-separated list of Loki modules to load for the write
targetModule: "write"
# -- Additional CLI args for the write
extraArgs: []
# -- Environment variables to add to the write pods
extraEnv: []
# -- Environment variables from secrets or configmaps to add to the write pods
extraEnvFrom: []
# -- Lifecycle for the write container
lifecycle: {}
# -- The default /flush_shutdown preStop hook is recommended as part of the ingester
# scaledown process so it's added to the template by default when autoscaling is enabled,
# but it's disabled to optimize rolling restarts in instances that will never be scaled
# down or when using chunks storage with WAL disabled.
# https://github.com/grafana/loki/blob/main/docs/sources/operations/storage/wal.md#how-to-scale-updown
# -- Init containers to add to the write pods
initContainers: []
# -- Volume mounts to add to the write pods
extraVolumeMounts: []
# -- Volumes to add to the write pods
extraVolumes: []
# -- volumeClaimTemplates to add to StatefulSet
extraVolumeClaimTemplates: []
# -- Resource requests and limits for the write
resources:
limits:
cpu: 2
memory: 8Gi
requests:
cpu: 700m
memory: 6Gi
# -- Grace period to allow the write to shutdown before it is killed. Especially for the ingester,
# this must be increased. It must be long enough so writes can be gracefully shutdown flushing/transferring
# all data and to successfully leave the member ring on shutdown.
terminationGracePeriodSeconds: 300
# -- Affinity for write pods. Passed through `tpl` and, thus, to be configured as string
# @default -- Hard node and soft zone anti-affinity
affinity: |
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
{{- include "loki.writeSelectorLabels" . | nindent 10 }}
topologyKey: kubernetes.io/hostname
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type.nebula.despegar.com/nbl-platform
operator: Exists
# -- Node selector for write pods
nodeSelector: {}
# -- Tolerations for write pods
tolerations:
- key: node-type.nebula.despegar.com/nbl-platform
operator: Exists
effect: NoSchedule
# -- The default is to deploy all pods in parallel.
podManagementPolicy: "Parallel"
persistence:
# -- Enable StatefulSetAutoDeletePVC feature
enableStatefulSetAutoDeletePVC: false
# -- Size of persistent disk
size: 200Gi
# -- Storage class to be used.
# If defined, storageClassName: <storageClass>.
# If set to "-", storageClassName: "", which disables dynamic provisioning.
# If empty or set to null, no storageClassName spec is
# set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
storageClass: gp3
# -- Selector for persistent disk
selector: null