Mimir Ingesters failing on "no space left on device"

  • What Grafana version and what operating system are you using?
    Helm chart mimir-distributed with mimir tag 2.0.0 on EKS, K8s version 1.21

  • What are you trying to achieve?
    Trying to reinstall mimir with ingesters running

  • How are you trying to achieve it?
    Using Helm to reinstall mimir

  • What happened?
    All ingesters go into crash loop backoff

  • What did you expect to happen?
    ingesters should run successfully

  • Can you copy/paste the configuration(s) that you are having problems with?

serviceAccount:
  create: false
  name: s3-full

minio:
  enabled: false

store_gateway:
  replicas: 1
  sharding_ring:
    kvstore:
      store: memberlist

ingester:
  nodeSelector:
    role: metrics
  tolerations:
    - key: role
      operator: Equal
      value: metrics
      effect: NoSchedule

mimir:
  # -- Config file for Grafana Mimir, enables templates. Needs to be copied in full for modifications.
  config: |
    {{- if not .Values.enterprise.enabled -}}
    multitenancy_enabled: false
    {{- end }}
    limits:
      ingestion_rate: 40000
      max_global_series_per_user: 1000000
      max_global_series_per_metric: 0
    activity_tracker:
      filepath: /data/metrics-activity.log
    alertmanager:
      data_dir: '/data'
      enable_api: true
      external_url: '/alertmanager'
    alertmanager_storage:
      backend: s3
      s3:
        endpoint: s3.${region}.amazonaws.com
        bucket_name: ${alertmanager_bucket}
        region: ${region}
    frontend_worker:
      frontend_address: {{ template "mimir.fullname" . }}-query-frontend-headless.{{ .Release.Namespace }}.svc:{{ include "mimir.serverGrpcListenPort" . }}
    ruler:
      enable_api: true
      rule_path: '/data'
      alertmanager_url: dnssrvnoa+http://_http-metrics._tcp.{{ template "mimir.fullname" . }}-alertmanager-headless.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}/alertmanager
    server:
      grpc_server_max_recv_msg_size: 104857600
      grpc_server_max_send_msg_size: 104857600
      grpc_server_max_concurrent_streams: 1000
    frontend:
      log_queries_longer_than: 10s
      align_queries_with_step: true
    compactor:
      data_dir: "/data"
    ingester:
      ring:
        final_sleep: 0s
        num_tokens: 512
    ingester_client:
      grpc_client_config:
        max_recv_msg_size: 104857600
        max_send_msg_size: 104857600
    runtime_config:
      file: /var/{{ include "mimir.name" . }}/runtime.yaml
    memberlist:
      abort_if_cluster_join_fails: false
      compression_enabled: false
      join_members:
      - {{ include "mimir.fullname" . }}-gossip-ring
    # This configures how the store-gateway synchronizes blocks stored in the bucket. It uses Minio by default for getting started (configured via flags) but this should be changed for production deployments.
    blocks_storage:
      backend: s3
      tsdb:
        dir: /data/tsdb
        wal_compression_enabled: true
        retention_period: 4h
      bucket_store:
        sync_dir: /data/tsdb-sync
        {{- if .Values.memcached.enabled }}
        chunks_cache:
          backend: memcached
          memcached:
            addresses: dns+{{ .Release.Name }}-memcached.{{ .Release.Namespace }}.svc:11211
            max_item_size: {{ .Values.memcached.maxItemMemory }}
        {{- end }}
        {{- if index .Values "memcached-metadata" "enabled" }}
        metadata_cache:
          backend: memcached
          memcached:
            addresses: dns+{{ .Release.Name }}-memcached-metadata.{{ .Release.Namespace }}.svc:11211
            max_item_size: {{ (index .Values "memcached-metadata").maxItemMemory }}
        {{- end }}
        {{- if index .Values "memcached-queries" "enabled" }}
        index_cache:
          backend: memcached
          memcached:
            addresses: dns+{{ .Release.Name }}-memcached-queries.{{ .Release.Namespace }}.svc:11211
            max_item_size: {{ (index .Values "memcached-queries").maxItemMemory }}
        {{- end }}
      s3:
        endpoint: s3.${region}.amazonaws.com
        bucket_name: ${metrics_bucket}
        region: ${region}
    ruler_storage:
      backend: s3
      s3:
        endpoint: s3.${region}.amazonaws.com
        bucket_name: ${ruler_bucket}
        region: ${region}
    {{- if .Values.enterprise.enabled }}
    multitenancy_enabled: true
    admin_api:
      leader_election:
        enabled: true
        ring:
          kvstore:
            store: "memberlist"
    {{- if .Values.minio.enabled }}
    admin_client:
      storage:
        type: s3
        s3:
          endpoint: {{ .Release.Name }}-minio.{{ .Release.Namespace }}.svc:9000
          bucket_name: enterprise-metrics-admin
          access_key_id: {{ .Values.minio.accessKey }}
          secret_access_key: {{ .Values.minio.secretKey }}
          insecure: true
    {{- end }}
    auth:
      type: enterprise
    cluster_name: "{{ .Release.Name }}"
    license:
      path: "/license/license.jwt"
    {{- if .Values.gateway.useDefaultProxyURLs }}
    gateway:
      proxy:
        default:
          url: http://{{ template "mimir.fullname" . }}-admin-api.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        admin_api:
          url: http://{{ template "mimir.fullname" . }}-admin-api.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        alertmanager:
          url: http://{{ template "mimir.fullname" . }}-alertmanager.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        compactor:
          url: http://{{ template "mimir.fullname" . }}-compactor.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        distributor:
          url: http://{{ template "mimir.fullname" . }}-distributor.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        ingester:
          url: http://{{ template "mimir.fullname" . }}-ingester.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        query_frontend:
          url: http://{{ template "mimir.fullname" . }}-query-frontend.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        ruler:
          url: http://{{ template "mimir.fullname" . }}-ruler.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        store_gateway:
          url: http://{{ template "mimir.fullname" . }}-store-gateway.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
    {{- end }}
    instrumentation:
      enabled: true
      distributor_client:
        address: 'dns:///{{ template "mimir.fullname" . }}-distributor.{{ .Release.Namespace }}.svc:{{ include "mimir.serverGrpcListenPort" . }}'
    {{- end }}

  • Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.
level=error ts=2022-05-31T19:08:57.640268904Z caller=ingester.go:1599 msg="unable to open TSDB" err="failed to open TSDB: /data/tsdb/anonymous: open /data/tsdb/anonymous/wal/00000282: no space left on device" user=anonymous
level=error ts=2022-05-31T19:08:57.640311897Z caller=ingester.go:1675 msg="error while opening existing TSDBs" err="unable to open TSDB for user anonymous: failed to open TSDB: /data/tsdb/anonymous: open /data/tsdb/anonymous/wal/00000282: no space left on device"
level=error ts=2022-05-31T19:08:57.64039023Z caller=mimir.go:471 msg="module failed" module=ingester-service err="invalid service state: Failed, expected: Running, failure: opening existing TSDBs: unable to open TSDB for user anonymous: failed to open TSDB: /data/tsdb/anonymous: open /data/tsdb/anonymous/wal/00000282: no space left on device"
  • Did you follow any online instructions? If so, what is the URL?
    Installed mimir-distributed using Helm

  • Notes
    I have tried reinstalling Mimir and limiting it to K8s nodes with large volumes, but I’m still getting no space left on device errors. I also tried blowing away the contents of my bucket but to no avail. I am using IAM profiles to authenticate to S3, which works fine. I’m not getting any errors about shipping logs to S3. I don’t understand how Mimir could still be running up against a “no space left on device” error when it’s on a fresh node and a fresh install.

The error unable to open TSDB" err="failed to open TSDB: /data/tsdb/anonymous: open /data/tsdb/anonymous/wal/00000282: no space left on device refers to the Persistent Volume which is mounted in /data, could you please confirm whether this volume still has free space?

Regarding the question how it’s possible that this is happening after a re-install: Is it possible that the PersistentVolume mounted in /data has not been deleted during the re-install? In this case it could still be full with the data that was ingested before the reinstall.

Thank you so much @maurostettler I didn’t realize the PV defaults in the main helm chart are 2Gi which are way too low. Upsized those to 100Gi now with the following code:

ingester:
  persistentVolume:
    size: 100Gi

compactor:
  persistentVolume:
    size: 100Gi

store_gateway:
  persistentVolume:
    size: 100Gi

I had to delete manually the pvs and the pvcs before this would go through - Helm didn’t clean up the persistent volumes or volumes claims, which explains why a reinstall didn’t affect the error.

Thank you!