Mimir Ingesters failing on "no space left on device"

sarasensible · May 31, 2022, 7:35pm

What Grafana version and what operating system are you using?
Helm chart mimir-distributed with mimir tag 2.0.0 on EKS, K8s version 1.21
What are you trying to achieve?
Trying to reinstall mimir with ingesters running
How are you trying to achieve it?
Using Helm to reinstall mimir
What happened?
All ingesters go into crash loop backoff
What did you expect to happen?
ingesters should run successfully
Can you copy/paste the configuration(s) that you are having problems with?

serviceAccount:
  create: false
  name: s3-full

minio:
  enabled: false

store_gateway:
  replicas: 1
  sharding_ring:
    kvstore:
      store: memberlist

ingester:
  nodeSelector:
    role: metrics
  tolerations:
    - key: role
      operator: Equal
      value: metrics
      effect: NoSchedule

mimir:
  # -- Config file for Grafana Mimir, enables templates. Needs to be copied in full for modifications.
  config: |
    {{- if not .Values.enterprise.enabled -}}
    multitenancy_enabled: false
    {{- end }}
    limits:
      ingestion_rate: 40000
      max_global_series_per_user: 1000000
      max_global_series_per_metric: 0
    activity_tracker:
      filepath: /data/metrics-activity.log
    alertmanager:
      data_dir: '/data'
      enable_api: true
      external_url: '/alertmanager'
    alertmanager_storage:
      backend: s3
      s3:
        endpoint: s3.${region}.amazonaws.com
        bucket_name: ${alertmanager_bucket}
        region: ${region}
    frontend_worker:
      frontend_address: {{ template "mimir.fullname" . }}-query-frontend-headless.{{ .Release.Namespace }}.svc:{{ include "mimir.serverGrpcListenPort" . }}
    ruler:
      enable_api: true
      rule_path: '/data'
      alertmanager_url: dnssrvnoa+http://_http-metrics._tcp.{{ template "mimir.fullname" . }}-alertmanager-headless.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}/alertmanager
    server:
      grpc_server_max_recv_msg_size: 104857600
      grpc_server_max_send_msg_size: 104857600
      grpc_server_max_concurrent_streams: 1000
    frontend:
      log_queries_longer_than: 10s
      align_queries_with_step: true
    compactor:
      data_dir: "/data"
    ingester:
      ring:
        final_sleep: 0s
        num_tokens: 512
    ingester_client:
      grpc_client_config:
        max_recv_msg_size: 104857600
        max_send_msg_size: 104857600
    runtime_config:
      file: /var/{{ include "mimir.name" . }}/runtime.yaml
    memberlist:
      abort_if_cluster_join_fails: false
      compression_enabled: false
      join_members:
      - {{ include "mimir.fullname" . }}-gossip-ring
    # This configures how the store-gateway synchronizes blocks stored in the bucket. It uses Minio by default for getting started (configured via flags) but this should be changed for production deployments.
    blocks_storage:
      backend: s3
      tsdb:
        dir: /data/tsdb
        wal_compression_enabled: true
        retention_period: 4h
      bucket_store:
        sync_dir: /data/tsdb-sync
        {{- if .Values.memcached.enabled }}
        chunks_cache:
          backend: memcached
          memcached:
            addresses: dns+{{ .Release.Name }}-memcached.{{ .Release.Namespace }}.svc:11211
            max_item_size: {{ .Values.memcached.maxItemMemory }}
        {{- end }}
        {{- if index .Values "memcached-metadata" "enabled" }}
        metadata_cache:
          backend: memcached
          memcached:
            addresses: dns+{{ .Release.Name }}-memcached-metadata.{{ .Release.Namespace }}.svc:11211
            max_item_size: {{ (index .Values "memcached-metadata").maxItemMemory }}
        {{- end }}
        {{- if index .Values "memcached-queries" "enabled" }}
        index_cache:
          backend: memcached
          memcached:
            addresses: dns+{{ .Release.Name }}-memcached-queries.{{ .Release.Namespace }}.svc:11211
            max_item_size: {{ (index .Values "memcached-queries").maxItemMemory }}
        {{- end }}
      s3:
        endpoint: s3.${region}.amazonaws.com
        bucket_name: ${metrics_bucket}
        region: ${region}
    ruler_storage:
      backend: s3
      s3:
        endpoint: s3.${region}.amazonaws.com
        bucket_name: ${ruler_bucket}
        region: ${region}
    {{- if .Values.enterprise.enabled }}
    multitenancy_enabled: true
    admin_api:
      leader_election:
        enabled: true
        ring:
          kvstore:
            store: "memberlist"
    {{- if .Values.minio.enabled }}
    admin_client:
      storage:
        type: s3
        s3:
          endpoint: {{ .Release.Name }}-minio.{{ .Release.Namespace }}.svc:9000
          bucket_name: enterprise-metrics-admin
          access_key_id: {{ .Values.minio.accessKey }}
          secret_access_key: {{ .Values.minio.secretKey }}
          insecure: true
    {{- end }}
    auth:
      type: enterprise
    cluster_name: "{{ .Release.Name }}"
    license:
      path: "/license/license.jwt"
    {{- if .Values.gateway.useDefaultProxyURLs }}
    gateway:
      proxy:
        default:
          url: http://{{ template "mimir.fullname" . }}-admin-api.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        admin_api:
          url: http://{{ template "mimir.fullname" . }}-admin-api.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        alertmanager:
          url: http://{{ template "mimir.fullname" . }}-alertmanager.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        compactor:
          url: http://{{ template "mimir.fullname" . }}-compactor.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        distributor:
          url: http://{{ template "mimir.fullname" . }}-distributor.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        ingester:
          url: http://{{ template "mimir.fullname" . }}-ingester.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        query_frontend:
          url: http://{{ template "mimir.fullname" . }}-query-frontend.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        ruler:
          url: http://{{ template "mimir.fullname" . }}-ruler.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
        store_gateway:
          url: http://{{ template "mimir.fullname" . }}-store-gateway.{{ .Release.Namespace }}.svc:{{ include "mimir.serverHttpListenPort" . }}
    {{- end }}
    instrumentation:
      enabled: true
      distributor_client:
        address: 'dns:///{{ template "mimir.fullname" . }}-distributor.{{ .Release.Namespace }}.svc:{{ include "mimir.serverGrpcListenPort" . }}'
    {{- end }}

Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.

level=error ts=2022-05-31T19:08:57.640268904Z caller=ingester.go:1599 msg="unable to open TSDB" err="failed to open TSDB: /data/tsdb/anonymous: open /data/tsdb/anonymous/wal/00000282: no space left on device" user=anonymous
level=error ts=2022-05-31T19:08:57.640311897Z caller=ingester.go:1675 msg="error while opening existing TSDBs" err="unable to open TSDB for user anonymous: failed to open TSDB: /data/tsdb/anonymous: open /data/tsdb/anonymous/wal/00000282: no space left on device"
level=error ts=2022-05-31T19:08:57.64039023Z caller=mimir.go:471 msg="module failed" module=ingester-service err="invalid service state: Failed, expected: Running, failure: opening existing TSDBs: unable to open TSDB for user anonymous: failed to open TSDB: /data/tsdb/anonymous: open /data/tsdb/anonymous/wal/00000282: no space left on device"

Did you follow any online instructions? If so, what is the URL?
Installed mimir-distributed using Helm
Notes
I have tried reinstalling Mimir and limiting it to K8s nodes with large volumes, but I’m still getting no space left on device errors. I also tried blowing away the contents of my bucket but to no avail. I am using IAM profiles to authenticate to S3, which works fine. I’m not getting any errors about shipping logs to S3. I don’t understand how Mimir could still be running up against a “no space left on device” error when it’s on a fresh node and a fresh install.

maurostettler · May 31, 2022, 7:56pm

The error unable to open TSDB" err="failed to open TSDB: /data/tsdb/anonymous: open /data/tsdb/anonymous/wal/00000282: no space left on device refers to the Persistent Volume which is mounted in /data, could you please confirm whether this volume still has free space?

Regarding the question how it’s possible that this is happening after a re-install: Is it possible that the PersistentVolume mounted in /data has not been deleted during the re-install? In this case it could still be full with the data that was ingested before the reinstall.

sarasensible · May 31, 2022, 8:19pm

Thank you so much @maurostettler I didn’t realize the PV defaults in the main helm chart are 2Gi which are way too low. Upsized those to 100Gi now with the following code:

ingester:
  persistentVolume:
    size: 100Gi

compactor:
  persistentVolume:
    size: 100Gi

store_gateway:
  persistentVolume:
    size: 100Gi

I had to delete manually the pvs and the pvcs before this would go through - Helm didn’t clean up the persistent volumes or volumes claims, which explains why a reinstall didn’t affect the error.

Thank you!

Topic		Replies	Views
Error with mimir ingester Configuration mimir	0	563	August 22, 2024
Deploy mimir using external S3 bucket Installation mimir	0	393	February 24, 2024
Mimir upgrade to helm chart 4.4.1 - now getting bucket errors in ingester and store gateway Configuration mimir	1	1208	June 9, 2023
Grafana Mimir 15 characters Configuration	1	287	February 16, 2024
Grafana + Mimir + Minio Configuration mimir , minio	0	1609	February 22, 2024

Mimir Ingesters failing on "no space left on device"

Related topics