Logs disappearing

I’m deploying Loki (distributed mode) along with Tempo, Grafana, Prometheus, Promtail, etc. using Helm.

I think I picked the right options for deploying the loki-distributed chart but something is obviously not working properly… I can see logs in Grafana but after about an hour, they disappear. Up until this morning, Grafana would just show no results. However, now I have been seeing this error in the Grafana UI and also in my querier pod:

level=error ts=2022-01-31T21:37:05.641170853Z caller=batch.go:699 msg=“error fetching chunks” err=“open /var/loki/chunks/ZmFrZS9lYmIzMWQ1NzU5ZmNhNGYzOjE3ZWIxZWUxODEzOjE3ZWIxZWUxODE0OjkyZDhlMTNm: no such file or directory”

I exec’d into the querier pod and checked /var/loki/chunks and it is empty in all 3 of my pods (I’m running a 6 node cluster).

The documentation for this chart leaves a lot to be desired so I was guessing at a lot of these values – I just enabled persistence wherever it was an option. I don’t know if that made sense because the default was strangely false everywhere.

What I am attempting to do is keep logs for 14 days – hence my limits_config and compactor

I have a chart of charts for this and my Loki settings in my values.yaml:

loki:
  loki:
    config: |
      auth_enabled: false

      server:
        http_listen_port: 3100

      distributor:
        ring:
          kvstore:
            store: memberlist

      memberlist:
        join_members:
          - {{ include "loki.fullname" . }}-memberlist

      ingester:
        lifecycler:
          ring:
            kvstore:
              store: memberlist
            replication_factor: 1
        chunk_idle_period: 30m
        chunk_block_size: 262144
        chunk_encoding: snappy
        chunk_retain_period: 1m
        max_transfer_retries: 0
        wal:
          dir: /var/loki/wal

      limits_config:
        enforce_metric_name: false
        reject_old_samples: true
        reject_old_samples_max_age: 168h
        max_cache_freshness_per_query: 10m
        retention_period: 336h

      {{- if .Values.loki.schemaConfig}}
      schema_config:
      {{- toYaml .Values.loki.schemaConfig | nindent 2}}
      {{- end}}
      storage_config:
        boltdb_shipper:
          active_index_directory: /var/loki/index
          cache_location: /var/loki/cache
          cache_ttl: 168h
          shared_store: filesystem
          index_gateway_client:
            server_address: dns:///obs-loki-index-gateway:9095
        filesystem:
          directory: /var/loki/chunks

      chunk_store_config:
        max_look_back_period: 0s

      table_manager:
        retention_deletes_enabled: false
        retention_period: 0s

      query_range:
        align_queries_with_step: true
        max_retries: 5
        split_queries_by_interval: 15m
        cache_results: true
        results_cache:
          cache:
            enable_fifocache: true
            fifocache:
              max_size_items: 1024
              validity: 24h

      frontend_worker:
        frontend_address: {{ include "loki.queryFrontendFullname" . }}:9095

      frontend:
        log_queries_longer_than: 5s
        compress_responses: true
        tail_proxy_url: http://{{ include "loki.querierFullname" . }}:3100

      compactor:
        working_directory: /data/retention
        shared_store: filesystem
        compaction_interval: 10m
        retention_enabled: true
        retention_delete_delay: 2h
        retention_delete_worker_count: 150

      ruler:
        storage:
          type: local
          local:
            directory: /etc/loki/rules
        ring:
          kvstore:
            store: memberlist
        rule_path: /tmp/loki/scratch
        alertmanager_url: https://alertmanager.xx
        external_url: https://alertmanager.xx

  ingester:
    replicas: 3
    persistence:
      # -- Enable creating PVCs which is required when using boltdb-shipper
      enabled: true
      # -- Size of persistent disk
      size: 10Gi
      # -- Storage class to be used.
      # If defined, storageClassName: <storageClass>.
      # If set to "-", storageClassName: "", which disables dynamic provisioning.
      # If empty or set to null, no storageClassName spec is
      # set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
      storageClass: null
  
  distributor:
    replicas: 3
  
  querier:
    replicas: 3
    persistence:
      # -- Enable creating PVCs for the querier cache
      enabled: true
      # -- Size of persistent disk
      size: 10Gi
      # -- Storage class to be used.
      # If defined, storageClassName: <storageClass>.
      # If set to "-", storageClassName: "", which disables dynamic provisioning.
      # If empty or set to null, no storageClassName spec is
      # set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
      storageClass: null
    extraVolumes:
    - name: bolt-db
      emptyDir: {}
    extraVolumeMounts:
    - name: bolt-db
      mountPath: /var/loki

  ruler:
    enabled: false
    replicas: 1

  indexGateway:
    enabled: true
    replicas: 3
    persistence:
      # -- Enable creating PVCs which is required when using boltdb-shipper
      enabled: true
      # -- Size of persistent disk
      size: 10Gi
      # -- Storage class to be used.
      # If defined, storageClassName: <storageClass>.
      # If set to "-", storageClassName: "", which disables dynamic provisioning.
      # If empty or set to null, no storageClassName spec is
      # set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
      storageClass: null

  queryFrontend:
    replicas: 3

  gateway:
    replicas: 3

  compactor:
    enabled: true
    persistence:
      # -- Enable creating PVCs for the compactor
      enabled: true
      # -- Size of persistent disk
      size: 10Gi
      # -- Storage class to be used.
      # If defined, storageClassName: <storageClass>.
      # If set to "-", storageClassName: "", which disables dynamic provisioning.
      # If empty or set to null, no storageClassName spec is
      # set, choosing the default provisioner (gp2 on AWS, standard on GKE, AWS, and OpenStack).
      storageClass: null
    serviceAccount:
      create: true

This is my loki config map as deployed on the cluster:

apiVersion: v1
data:
  config.yaml: |
    auth_enabled: false

    server:
      http_listen_port: 3100

    distributor:
      ring:
        kvstore:
          store: memberlist

    memberlist:
      join_members:
        - obs-loki-memberlist

    ingester:
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1
      chunk_idle_period: 30m
      chunk_block_size: 262144
      chunk_encoding: snappy
      chunk_retain_period: 1m
      max_transfer_retries: 0
      wal:
        dir: /var/loki/wal

    limits_config:
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      max_cache_freshness_per_query: 10m
      retention_period: 336h
    schema_config:
      configs:
      - from: "2020-09-07"
        index:
          period: 24h
          prefix: loki_index_
        object_store: filesystem
        schema: v11
        store: boltdb-shipper
    storage_config:
      boltdb_shipper:
        active_index_directory: /var/loki/index
        cache_location: /var/loki/cache
        cache_ttl: 168h
        shared_store: filesystem
        index_gateway_client:
          server_address: dns:///obs-loki-index-gateway:9095
      filesystem:
        directory: /var/loki/chunks

    chunk_store_config:
      max_look_back_period: 0s

    table_manager:
      retention_deletes_enabled: false
      retention_period: 0s

    query_range:
      align_queries_with_step: true
      max_retries: 5
      split_queries_by_interval: 15m
      cache_results: true
      results_cache:
        cache:
          enable_fifocache: true
          fifocache:
            max_size_items: 1024
            validity: 24h

    frontend_worker:
      frontend_address: obs-loki-query-frontend:9095

    frontend:
      log_queries_longer_than: 5s
      compress_responses: true
      tail_proxy_url: http://obs-loki-querier:3100

    compactor:
      working_directory: /data/retention
      shared_store: filesystem
      compaction_interval: 10m
      retention_enabled: true
      retention_delete_delay: 2h
      retention_delete_worker_count: 150

    ruler:
      storage:
        type: local
        local:
          directory: /etc/loki/rules
      ring:
        kvstore:
          store: memberlist
      rule_path: /tmp/loki/scratch
      alertmanager_url: https://alertmanager.xx
      external_url: https://alertmanager.xx

This doesn’t quite seem right but I am not sure what the correct value is… However, the compactor period is 14 days so I don’t think the compactor is to blame here for these missing logs.

    compactor:
      working_directory: /data/retention

Why are my logs disappearing? Do I need to run the ruler? My traces are also disappearing in Tempo but I don’t know if that’s related…

Update 1:
I tried changing my querier to not use persistent volumes which I think just means it keeps an in memory cache instead of an on-disk cache. That doesn’t seem to help so far.

Update 2:
I just tested a log search for the last 12 hours for a particular namespace: {namespace="obs"} which should include the logs from the various Loki components. I get an error in Grafana that says:

Query error
open /var/loki/chunks/ZmFrZS9hMWVjYThlZDA0OGZkM2NjOjE3ZWIzMGE0ODVmOjE3ZWIzNzgyNTYyOjg2MDI4Nzgw: no such file or directory

So I check my ingester pods and I have 3 running:

obs-loki-ingester-0
obs-loki-ingester-1
obs-loki-ingester-2

I then ran kubectl exec -it obs-loki-ingester-1 -n obs -- ls -la /var/loki/chunks/ZmFrZS9hMWVjYThlZDA0OGZkM2NjOjE3ZWIzMGE0ODVmOjE3ZWIzNzgyNTYyOjg2MDI4Nzgw for each pod and I see obs-loki-ingester-1 does indeed have this file. So something is screwed up somewhere causing Loki to not be aware of this…

I have Grafana pointed to my querier service http://obs-loki-querier.obs:3100 as a data source. Is this not the correct spot to point at for my Grafana data source? Should my Grafana data source be pointing at obs-loki-querier-frontend.obs (the querier-frontend service)?

Update 3:
I tried pointing my Grafana data source to obs-loki-querier-frontend.obs (the querier-frontend service) and then the “Test” button under Data Sources fails… So that doesn’t seem to be the fix.

At this point I think I have determined that using filesystem storage is NOT an option for Loki in distributed mode.

Reference: Storage | Grafana Labs