Deleting a corrupted directory for boltdb (file too small)

I currently have a downed pod which was generated through a helm install of loki. The helm install created a statefulset that controls the loki pod instances. The pod is getting stuck in a back-off restarting failed container state. This is the log from the loki-0 pod:

level=info ts=2022-07-05T19:10:11.52783126Z caller=main.go:94 msg="Starting Loki" version="(version=2.4.2, branch=HEAD, revision=525040a32)"
level=info ts=2022-07-05T19:10:11.528004697Z caller=modules.go:573 msg="RulerStorage is not configured in single binary mode and will not be started."
level=info ts=2022-07-05T19:10:11.528533942Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=warn ts=2022-07-05T19:10:11.52905735Z caller=experimental.go:19 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=info ts=2022-07-05T19:10:11.529660842Z caller=table_manager.go:241 msg="loading table index_19128"
level=error ts=2022-07-05T19:10:11.529969171Z caller=table.go:491 msg="failed to open file /data/loki/boltdb-shipper-active/index_19128/1652736600. Please fix or remove this file." err="file size too small"
unexpected fault address 0x7f9b138b7008
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7f9b138b7008 pc=0x17d347e]

goroutine 1 [running]:
runtime.throw({0x223d9b7, 0x7f9b13b54468})
	/usr/local/go/src/runtime/panic.go:1198 +0x71 fp=0xc00068bd70 sp=0xc00068bd40 pc=0x435851
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:732 +0x125 fp=0xc00068bdc0 sp=0xc00068bd70 pc=0x44bc05
go.etcd.io/bbolt.(*Cursor).search(0xc00068bf08, {0x38d92b0, 0x5, 0x5}, 0xc00068bea0)
	/src/loki/vendor/go.etcd.io/bbolt/cursor.go:249 +0x5e fp=0xc00068be58 sp=0xc00068bdc0 pc=0x17d347e

This is the loki-values.yaml file:

loki:
  image:
    pullPolicy: IfNotPresent
    repository: grafana/loki
    tag: 2.4.2
  persistence:
    enabled: true
    size: 12Gi
    storageClassName: do-block-storage
  config:
    ingester:
      wal:
        enabled: false # After upgrade to 2.4.1, we encountered a problem "mkdir wal: read-only file system".  Tried creating /data/loki/wal directory manually but it still did not work.
    compactor:
      working_directory: /data/loki/retention
      shared_store: filesystem
      compaction_interval: 10m
      retention_enabled: true
      retention_delete_delay: 2h
      retention_delete_worker_count: 150
    limits_config:
      retention_period: 336h # 2 weeks
    schema_config:
      configs:
      - from: "2021-11-26"
        store: boltdb-shipper
        object_store: filesystem
        schema: v11
        index:
          period: 24h
          prefix: index_
    storage_config:
      boltdb:
        directory: /data/loki/index
      boltdb_shipper:
        active_index_directory: /data/loki/boltdb-shipper-active
        cache_location: /data/loki/boltdb-shipper-cache
        cache_ttl: 24h
        shared_store: filesystem
      filesystem:
        directory: /data/loki/chunks
  tolerations:
  - effect: NoExecute
    key: environment
    operator: Equal
    value: prod
promtail:
  tolerations:
  - effect: NoExecute
    key: environment
    operator: Equal
    value: prod

I tried to execute a bash command to delete the affected directory /data/loki/boltdb-shipper-active/index_19128/ by adding command arguments in the pod yaml configuration however I was not allowed to modify the spec container of the pod directory as it was resulting in an error. The error was related a a key name which couldn’t be matched. I’m guessing it has to do with no spec with keyword lifecycle was specified or is specified in the original stateful set to match the supplied config.

loki-0 pod yaml:

apiVersion: v1
kind: Pod
metadata:
  name: loki-0
  generateName: loki-
  namespace: monitoring
  uid: 30d23988-b124-4a5a-910b-d7b494ecfb34
  resourceVersion: '176526416'
  creationTimestamp: '2022-07-04T20:51:50Z'
  labels:
    app: loki
    controller-revision-hash: loki-687c55fc5
    name: loki
    release: loki
    statefulset.kubernetes.io/pod-name: loki-0
  annotations:
    checksum/config: f658e8a0ef515ab2e874b194df8f08c7fd5fc3e8f9f6128943b577fe5d503628
    prometheus.io/port: http-metrics
    prometheus.io/scrape: 'true'
  ownerReferences:
    - apiVersion: apps/v1
      kind: StatefulSet
      name: loki
      uid: 729aeecb-2495-4e1a-b0a3-2a7cdfe5ebdd
      controller: true
      blockOwnerDeletion: true
  managedFields:
    - manager: kube-controller-manager
      operation: Update
      apiVersion: v1
      time: '2022-07-04T20:51:50Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:checksum/config: {}
            f:prometheus.io/port: {}
            f:prometheus.io/scrape: {}
          f:generateName: {}
          f:labels:
            .: {}
            f:app: {}
            f:controller-revision-hash: {}
            f:name: {}
            f:release: {}
            f:statefulset.kubernetes.io/pod-name: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"729aeecb-2495-4e1a-b0a3-2a7cdfe5ebdd"}:
              .: {}
              f:apiVersion: {}
              f:blockOwnerDeletion: {}
              f:controller: {}
              f:kind: {}
              f:name: {}
              f:uid: {}
        f:spec:
          f:affinity: {}
          f:containers:
            k:{"name":"loki"}:
              .: {}
              f:args: {}
              f:image: {}
              f:imagePullPolicy: {}
              f:livenessProbe:
                .: {}
                f:failureThreshold: {}
                f:httpGet:
                  .: {}
                  f:path: {}
                  f:port: {}
                  f:scheme: {}
                f:initialDelaySeconds: {}
                f:periodSeconds: {}
                f:successThreshold: {}
                f:timeoutSeconds: {}
              f:name: {}
              f:ports:
                .: {}
                k:{"containerPort":3100,"protocol":"TCP"}:
                  .: {}
                  f:containerPort: {}
                  f:name: {}
                  f:protocol: {}
              f:readinessProbe:
                .: {}
                f:failureThreshold: {}
                f:httpGet:
                  .: {}
                  f:path: {}
                  f:port: {}
                  f:scheme: {}
                f:initialDelaySeconds: {}
                f:periodSeconds: {}
                f:successThreshold: {}
                f:timeoutSeconds: {}
              f:resources: {}
              f:securityContext:
                .: {}
                f:readOnlyRootFilesystem: {}
              f:terminationMessagePath: {}
              f:terminationMessagePolicy: {}
              f:volumeMounts:
                .: {}
                k:{"mountPath":"/data"}:
                  .: {}
                  f:mountPath: {}
                  f:name: {}
                k:{"mountPath":"/etc/loki"}:
                  .: {}
                  f:mountPath: {}
                  f:name: {}
          f:dnsPolicy: {}
          f:enableServiceLinks: {}
          f:hostname: {}
          f:restartPolicy: {}
          f:schedulerName: {}
          f:securityContext:
            .: {}
            f:fsGroup: {}
            f:runAsGroup: {}
            f:runAsNonRoot: {}
            f:runAsUser: {}
          f:serviceAccount: {}
          f:serviceAccountName: {}
          f:subdomain: {}
          f:terminationGracePeriodSeconds: {}
          f:tolerations: {}
          f:volumes:
            .: {}
            k:{"name":"config"}:
              .: {}
              f:name: {}
              f:secret:
                .: {}
                f:defaultMode: {}
                f:secretName: {}
            k:{"name":"storage"}:
              .: {}
              f:name: {}
              f:persistentVolumeClaim:
                .: {}
                f:claimName: {}
    - manager: kubelet
      operation: Update
      apiVersion: v1
      time: '2022-07-04T20:52:20Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          f:conditions:
            k:{"type":"ContainersReady"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
            k:{"type":"Initialized"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:status: {}
              f:type: {}
            k:{"type":"Ready"}:
              .: {}
              f:lastProbeTime: {}
              f:lastTransitionTime: {}
              f:message: {}
              f:reason: {}
              f:status: {}
              f:type: {}
          f:containerStatuses: {}
          f:hostIP: {}
          f:phase: {}
          f:podIP: {}
          f:podIPs:
            .: {}
            k:{"ip":"10.244.1.222"}:
              .: {}
              f:ip: {}
          f:startTime: {}
  selfLink: /api/v1/namespaces/monitoring/pods/loki-0
status:
  phase: Running
  conditions:
    - type: Initialized
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2022-07-04T20:51:50Z'
    - type: Ready
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2022-07-04T20:51:50Z'
      reason: ContainersNotReady
      message: 'containers with unready status: [loki]'
    - type: ContainersReady
      status: 'False'
      lastProbeTime: null
      lastTransitionTime: '2022-07-04T20:51:50Z'
      reason: ContainersNotReady
      message: 'containers with unready status: [loki]'
    - type: PodScheduled
      status: 'True'
      lastProbeTime: null
      lastTransitionTime: '2022-07-04T20:51:50Z'
  hostIP: 10.114.0.2
  podIP: 10.244.1.222
  podIPs:
    - ip: 10.244.1.222
  startTime: '2022-07-04T20:51:50Z'
  containerStatuses:
    - name: loki
      state:
        waiting:
          reason: CrashLoopBackOff
          message: >-
            back-off 5m0s restarting failed container=loki
            pod=loki-0_monitoring(30d23988-b124-4a5a-910b-d7b494ecfb34)
      lastState:
        terminated:
          exitCode: 2
          reason: Error
          startedAt: '2022-07-05T19:15:19Z'
          finishedAt: '2022-07-05T19:15:19Z'
          containerID: >-
            containerd://7545732da1d82ef04a34de13edbf3d512b256751d63fdb9b4889316bac09ffda
      ready: false
      restartCount: 267
      image: docker.io/grafana/loki:2.4.2
      imageID: >-
        docker.io/grafana/loki@sha256:b3af8ead67d7e80fec05029f783784df897e92b6dba31fe4b33ab4ea3e989573
      containerID: >-
        containerd://7545732da1d82ef04a34de13edbf3d512b256751d63fdb9b4889316bac09ffda
      started: false
  qosClass: BestEffort
spec:
  volumes:
    - name: storage
      persistentVolumeClaim:
        claimName: storage-loki-0
    - name: config
      secret:
        secretName: loki
        defaultMode: 420
    - name: kube-api-access-r2rz6
      projected:
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              name: kube-root-ca.crt
              items:
                - key: ca.crt
                  path: ca.crt
          - downwardAPI:
              items:
                - path: namespace
                  fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
        defaultMode: 420
  containers:
    - name: loki
      image: grafana/loki:2.4.2
      args:
        - '-config.file=/etc/loki/loki.yaml'
      ports:
        - name: http-metrics
          containerPort: 3100
          protocol: TCP
      resources: {}
      volumeMounts:
        - name: config
          mountPath: /etc/loki
        - name: storage
          mountPath: /data
        - name: kube-api-access-r2rz6
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      livenessProbe:
        httpGet:
          path: /ready
          port: http-metrics
          scheme: HTTP
        initialDelaySeconds: 45
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /ready
          port: http-metrics
          scheme: HTTP
        initialDelaySeconds: 45
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
      securityContext:
        readOnlyRootFilesystem: true
  restartPolicy: Always
  terminationGracePeriodSeconds: 4800
  dnsPolicy: ClusterFirst
  serviceAccountName: loki
  serviceAccount: loki
  nodeName: dev2-us7j8
  securityContext:
    runAsUser: 10001
    runAsGroup: 10001
    runAsNonRoot: true
    fsGroup: 10001
  hostname: loki-0
  subdomain: loki-headless
  affinity: {}
  schedulerName: default-scheduler
  tolerations:
    - key: environment
      operator: Equal
      value: prod
      effect: NoExecute
    - key: node.kubernetes.io/not-ready
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
  priority: 0
  enableServiceLinks: true
  preemptionPolicy: PreemptLowerPriority

Will deleting the affected file or directory resolve the issue or is it something which is not related to corruption?

Disclaimer: I am quite new at Kubernetes and YAML configuration but I am learning!

1 Like

Bumping this up. Is there an alternaitve way to delete a directory on the persistent volume if the pod has failed?

1 Like

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.