I currently have a downed pod which was generated through a helm install of loki. The helm install created a statefulset that controls the loki pod instances. The pod is getting stuck in a back-off restarting failed container state. This is the log from the loki-0 pod:
level=info ts=2022-07-05T19:10:11.52783126Z caller=main.go:94 msg="Starting Loki" version="(version=2.4.2, branch=HEAD, revision=525040a32)"
level=info ts=2022-07-05T19:10:11.528004697Z caller=modules.go:573 msg="RulerStorage is not configured in single binary mode and will not be started."
level=info ts=2022-07-05T19:10:11.528533942Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=warn ts=2022-07-05T19:10:11.52905735Z caller=experimental.go:19 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=info ts=2022-07-05T19:10:11.529660842Z caller=table_manager.go:241 msg="loading table index_19128"
level=error ts=2022-07-05T19:10:11.529969171Z caller=table.go:491 msg="failed to open file /data/loki/boltdb-shipper-active/index_19128/1652736600. Please fix or remove this file." err="file size too small"
unexpected fault address 0x7f9b138b7008
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7f9b138b7008 pc=0x17d347e]
goroutine 1 [running]:
runtime.throw({0x223d9b7, 0x7f9b13b54468})
/usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc00068bd70 sp=0xc00068bd40 pc=0x435851
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:732 +0x125 fp=0xc00068bdc0 sp=0xc00068bd70 pc=0x44bc05
go.etcd.io/bbolt.(*Cursor).search(0xc00068bf08, {0x38d92b0, 0x5, 0x5}, 0xc00068bea0)
/src/loki/vendor/go.etcd.io/bbolt/cursor.go:249 +0x5e fp=0xc00068be58 sp=0xc00068bdc0 pc=0x17d347e
This is the loki-values.yaml file:
loki:
image:
pullPolicy: IfNotPresent
repository: grafana/loki
tag: 2.4.2
persistence:
enabled: true
size: 12Gi
storageClassName: do-block-storage
config:
ingester:
wal:
enabled: false # After upgrade to 2.4.1, we encountered a problem "mkdir wal: read-only file system". Tried creating /data/loki/wal directory manually but it still did not work.
compactor:
working_directory: /data/loki/retention
shared_store: filesystem
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
limits_config:
retention_period: 336h # 2 weeks
schema_config:
configs:
- from: "2021-11-26"
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
period: 24h
prefix: index_
storage_config:
boltdb:
directory: /data/loki/index
boltdb_shipper:
active_index_directory: /data/loki/boltdb-shipper-active
cache_location: /data/loki/boltdb-shipper-cache
cache_ttl: 24h
shared_store: filesystem
filesystem:
directory: /data/loki/chunks
tolerations:
- effect: NoExecute
key: environment
operator: Equal
value: prod
promtail:
tolerations:
- effect: NoExecute
key: environment
operator: Equal
value: prod
I tried to execute a bash command to delete the affected directory /data/loki/boltdb-shipper-active/index_19128/ by adding command arguments in the pod yaml configuration however I was not allowed to modify the spec container of the pod directory as it was resulting in an error. The error was related a a key name which couldn’t be matched. I’m guessing it has to do with no spec with keyword lifecycle was specified or is specified in the original stateful set to match the supplied config.
loki-0 pod yaml:
apiVersion: v1
kind: Pod
metadata:
name: loki-0
generateName: loki-
namespace: monitoring
uid: 30d23988-b124-4a5a-910b-d7b494ecfb34
resourceVersion: '176526416'
creationTimestamp: '2022-07-04T20:51:50Z'
labels:
app: loki
controller-revision-hash: loki-687c55fc5
name: loki
release: loki
statefulset.kubernetes.io/pod-name: loki-0
annotations:
checksum/config: f658e8a0ef515ab2e874b194df8f08c7fd5fc3e8f9f6128943b577fe5d503628
prometheus.io/port: http-metrics
prometheus.io/scrape: 'true'
ownerReferences:
- apiVersion: apps/v1
kind: StatefulSet
name: loki
uid: 729aeecb-2495-4e1a-b0a3-2a7cdfe5ebdd
controller: true
blockOwnerDeletion: true
managedFields:
- manager: kube-controller-manager
operation: Update
apiVersion: v1
time: '2022-07-04T20:51:50Z'
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:checksum/config: {}
f:prometheus.io/port: {}
f:prometheus.io/scrape: {}
f:generateName: {}
f:labels:
.: {}
f:app: {}
f:controller-revision-hash: {}
f:name: {}
f:release: {}
f:statefulset.kubernetes.io/pod-name: {}
f:ownerReferences:
.: {}
k:{"uid":"729aeecb-2495-4e1a-b0a3-2a7cdfe5ebdd"}:
.: {}
f:apiVersion: {}
f:blockOwnerDeletion: {}
f:controller: {}
f:kind: {}
f:name: {}
f:uid: {}
f:spec:
f:affinity: {}
f:containers:
k:{"name":"loki"}:
.: {}
f:args: {}
f:image: {}
f:imagePullPolicy: {}
f:livenessProbe:
.: {}
f:failureThreshold: {}
f:httpGet:
.: {}
f:path: {}
f:port: {}
f:scheme: {}
f:initialDelaySeconds: {}
f:periodSeconds: {}
f:successThreshold: {}
f:timeoutSeconds: {}
f:name: {}
f:ports:
.: {}
k:{"containerPort":3100,"protocol":"TCP"}:
.: {}
f:containerPort: {}
f:name: {}
f:protocol: {}
f:readinessProbe:
.: {}
f:failureThreshold: {}
f:httpGet:
.: {}
f:path: {}
f:port: {}
f:scheme: {}
f:initialDelaySeconds: {}
f:periodSeconds: {}
f:successThreshold: {}
f:timeoutSeconds: {}
f:resources: {}
f:securityContext:
.: {}
f:readOnlyRootFilesystem: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
f:volumeMounts:
.: {}
k:{"mountPath":"/data"}:
.: {}
f:mountPath: {}
f:name: {}
k:{"mountPath":"/etc/loki"}:
.: {}
f:mountPath: {}
f:name: {}
f:dnsPolicy: {}
f:enableServiceLinks: {}
f:hostname: {}
f:restartPolicy: {}
f:schedulerName: {}
f:securityContext:
.: {}
f:fsGroup: {}
f:runAsGroup: {}
f:runAsNonRoot: {}
f:runAsUser: {}
f:serviceAccount: {}
f:serviceAccountName: {}
f:subdomain: {}
f:terminationGracePeriodSeconds: {}
f:tolerations: {}
f:volumes:
.: {}
k:{"name":"config"}:
.: {}
f:name: {}
f:secret:
.: {}
f:defaultMode: {}
f:secretName: {}
k:{"name":"storage"}:
.: {}
f:name: {}
f:persistentVolumeClaim:
.: {}
f:claimName: {}
- manager: kubelet
operation: Update
apiVersion: v1
time: '2022-07-04T20:52:20Z'
fieldsType: FieldsV1
fieldsV1:
f:status:
f:conditions:
k:{"type":"ContainersReady"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:message: {}
f:reason: {}
f:status: {}
f:type: {}
k:{"type":"Initialized"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:status: {}
f:type: {}
k:{"type":"Ready"}:
.: {}
f:lastProbeTime: {}
f:lastTransitionTime: {}
f:message: {}
f:reason: {}
f:status: {}
f:type: {}
f:containerStatuses: {}
f:hostIP: {}
f:phase: {}
f:podIP: {}
f:podIPs:
.: {}
k:{"ip":"10.244.1.222"}:
.: {}
f:ip: {}
f:startTime: {}
selfLink: /api/v1/namespaces/monitoring/pods/loki-0
status:
phase: Running
conditions:
- type: Initialized
status: 'True'
lastProbeTime: null
lastTransitionTime: '2022-07-04T20:51:50Z'
- type: Ready
status: 'False'
lastProbeTime: null
lastTransitionTime: '2022-07-04T20:51:50Z'
reason: ContainersNotReady
message: 'containers with unready status: [loki]'
- type: ContainersReady
status: 'False'
lastProbeTime: null
lastTransitionTime: '2022-07-04T20:51:50Z'
reason: ContainersNotReady
message: 'containers with unready status: [loki]'
- type: PodScheduled
status: 'True'
lastProbeTime: null
lastTransitionTime: '2022-07-04T20:51:50Z'
hostIP: 10.114.0.2
podIP: 10.244.1.222
podIPs:
- ip: 10.244.1.222
startTime: '2022-07-04T20:51:50Z'
containerStatuses:
- name: loki
state:
waiting:
reason: CrashLoopBackOff
message: >-
back-off 5m0s restarting failed container=loki
pod=loki-0_monitoring(30d23988-b124-4a5a-910b-d7b494ecfb34)
lastState:
terminated:
exitCode: 2
reason: Error
startedAt: '2022-07-05T19:15:19Z'
finishedAt: '2022-07-05T19:15:19Z'
containerID: >-
containerd://7545732da1d82ef04a34de13edbf3d512b256751d63fdb9b4889316bac09ffda
ready: false
restartCount: 267
image: docker.io/grafana/loki:2.4.2
imageID: >-
docker.io/grafana/loki@sha256:b3af8ead67d7e80fec05029f783784df897e92b6dba31fe4b33ab4ea3e989573
containerID: >-
containerd://7545732da1d82ef04a34de13edbf3d512b256751d63fdb9b4889316bac09ffda
started: false
qosClass: BestEffort
spec:
volumes:
- name: storage
persistentVolumeClaim:
claimName: storage-loki-0
- name: config
secret:
secretName: loki
defaultMode: 420
- name: kube-api-access-r2rz6
projected:
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
name: kube-root-ca.crt
items:
- key: ca.crt
path: ca.crt
- downwardAPI:
items:
- path: namespace
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
defaultMode: 420
containers:
- name: loki
image: grafana/loki:2.4.2
args:
- '-config.file=/etc/loki/loki.yaml'
ports:
- name: http-metrics
containerPort: 3100
protocol: TCP
resources: {}
volumeMounts:
- name: config
mountPath: /etc/loki
- name: storage
mountPath: /data
- name: kube-api-access-r2rz6
readOnly: true
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
livenessProbe:
httpGet:
path: /ready
port: http-metrics
scheme: HTTP
initialDelaySeconds: 45
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: http-metrics
scheme: HTTP
initialDelaySeconds: 45
timeoutSeconds: 1
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
securityContext:
readOnlyRootFilesystem: true
restartPolicy: Always
terminationGracePeriodSeconds: 4800
dnsPolicy: ClusterFirst
serviceAccountName: loki
serviceAccount: loki
nodeName: dev2-us7j8
securityContext:
runAsUser: 10001
runAsGroup: 10001
runAsNonRoot: true
fsGroup: 10001
hostname: loki-0
subdomain: loki-headless
affinity: {}
schedulerName: default-scheduler
tolerations:
- key: environment
operator: Equal
value: prod
effect: NoExecute
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
tolerationSeconds: 300
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 300
priority: 0
enableServiceLinks: true
preemptionPolicy: PreemptLowerPriority
Will deleting the affected file or directory resolve the issue or is it something which is not related to corruption?
Disclaimer: I am quite new at Kubernetes and YAML configuration but I am learning!