Loki Helm: loki-write : read only + 503 probe

Description:

After switching from loki-stack to the grafana/loki Helm chart in SimpleScalable mode, I’m facing multiple issues that make the deployment unreliable. The setup is not straightforward, and I’m encountering errors that prevent proper log ingestion and retrieval.

loki.storage.object_store doesn’t fill : object_store and can’t see my conf here:
k exec -it loki-write-0 – loki -print-config-stderr -config.file=etc/loki/config/config.yaml > loki.out.yaml

  • loki-write-* pods:
    level=error ts=2025-05-16T09:53:06.250813733Z caller=flush.go:261 component=ingester loop=5 org_id=fake msg=“failed to flush” retries=1 err="failed to flush chunks:
    store put chunk: mkdir fake: read-only file system,

    num_chunks: 20, labels: {app="web-app", container="web-app", filename="/var/log/pods/pqp-squad-rd-sns-topic-detection_sns-topic-detection-web-app-64d9d4f448-zc7gq_b01db329-8b99-4b83-b624-19e0f027cb14/web-app/0.log", instance="sns-topic-detection", job="pqp-squad-rd-sns-topic-detection/web-app", namespace="pqp-squad-rd-sns-topic-detection",
    node_name="nodepool-pqp-preproduction-b21-node-d89ece", pod="sns-topic-detection-web-app-64d9d4f448-zc7gq",
    service_name="web-app", stream="stdout"}"

loki-write-0 0/1 Running 3 (76m ago) 77m
loki-write-1 0/1 Running 0 77m
loki-write-2 0/1 Running 0 6m35s

Warning Unhealthy 98s (x32 over 6m18s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503

- Can not retrieve older logs.

The ingesters cannot write chunks to /var/loki/chunks due to a read-only filesystem, even though the volume is mounted as read-write.
Cannot retrieve older logs: Logs older than 2 days are not accessible in Grafana, despite a retention period set to 720 hours (30 days).
Default configuration issue: The default gRPC message size is too low to handle logs from containers like Traefik. These settings should be included by default in the Helm chart:
grpc_server_max_recv_msg_size: 104857600 # 100 Mo
grpc_server_max_send_msg_size: 104857600 # 100 Mo

To Reproduce

  1. Deploy the grafana/loki Helm chart with the provided configuration (see below) and an S3 bucket.
  2. Send logs to Loki using Grafana Alloy.
  3. Check the logs of loki-write-* pods for errors (e.g., read-only filesystem).
  4. Try querying logs from a Traefik container in Grafana for a period longer than 2 days.
  5. Launch : k exec -it loki-write-0 – loki -print-config-stderr -config.file=etc/loki/config/config.yaml > loki.out.yaml and see the object_store part.

Expected behavior

  1. No errors in loki-write-* pod logs.
  2. Ability to query logs from the entire retention period (720 hours) without issues.
  3. Default gRPC settings that support large log entries from containers like Traefik.

Environment:

  • Infrastructure: Kubernetes with 10 nodes of 4cpu and 15Gi.
  • k8s:1.31.1
  • Deployment tool: helm
  • grafana/loki (latest)
  • grafana/alloy (latest)
  • kube-proetheus-stack (latest)
  • s3 provider: OVH (aws s3 bucket)

Alloy config

# alloy.yml
# Variables pour le déploiement de Grafana Alloy

alloy_vars:
  crds:
    # -- Installer les CRDs pour le monitoring
    create: true

  ## Paramètres Alloy
  alloy:
    configMap:
      # -- Créer une nouvelle ConfigMap pour le fichier de configuration
      create: true
      # -- Nom de la ConfigMap existante à utiliser si create est à false
      name: alloy-config
      # -- Contenu de la ConfigMap. Supporte le templating via 'tpl'
      content: |-
        livedebugging {
          enabled = true
        }

        logging {
          level  = "info"
          format = "logfmt"
        }

        discovery.kubernetes "kubernetes_pods" {
          role = "pod"
        }

        discovery.kubernetes "nodes" {
          role = "node"
        }

        discovery.kubernetes "services" {
          role = "service"
        }

        discovery.kubernetes "endpoints" {
          role = "endpoints"
        }

        discovery.kubernetes "endpointslices" {
          role = "endpointslice"
        }

        discovery.kubernetes "ingresses" {
          role = "ingress"
        }

        // -- Configuration de la découverte des services Kubernetes
        discovery.relabel "kubernetes_pods" {
            targets = discovery.kubernetes.kubernetes_pods.targets

            rule {
                source_labels = ["__meta_kubernetes_pod_controller_name"]
                regex         = "([0-9a-z-.]+?)(-[0-9a-f]{8,10})?"
                target_label  = "__tmp_controller_name"
            }

            rule {
                source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name", "__meta_kubernetes_pod_label_app", "__tmp_controller_name", "__meta_kubernetes_pod_name"]
                regex         = "^;*([^;]+)(;.*)?$"
                target_label  = "app"
            }

            rule {
                source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_instance", "__meta_kubernetes_pod_label_instance"]
                regex         = "^;*([^;]+)(;.*)?$"
                target_label  = "instance"
            }

            rule {
                source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_component", "__meta_kubernetes_pod_label_component"]
                regex         = "^;*([^;]+)(;.*)?$"
                target_label  = "component"
            }

            rule {
                source_labels = ["__meta_kubernetes_pod_node_name"]
                target_label  = "node_name"
            }

            rule {
                source_labels = ["__meta_kubernetes_namespace"]
                target_label  = "namespace"
            }

            rule {
                source_labels = ["namespace", "app"]
                separator     = "/"
                target_label  = "job"
            }

            rule {
                source_labels = ["__meta_kubernetes_pod_name"]
                target_label  = "pod"
            }

            rule {
                source_labels = ["__meta_kubernetes_pod_container_name"]
                target_label  = "container"
            }

            rule {
                source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
                separator     = "/"
                target_label  = "__path__"
                replacement   = "/var/log/pods/*$1/*.log"
            }

            rule {
                source_labels = ["__meta_kubernetes_pod_annotationpresent_kubernetes_io_config_hash", "__meta_kubernetes_pod_annotation_kubernetes_io_config_hash", "__meta_kubernetes_pod_container_name"]
                separator     = "/"
                regex         = "true/(.*)"
                target_label  = "__path__"
                replacement   = "/var/log/pods/*$1/*.log"
            }

            rule {
                source_labels = ["__meta_kubernetes_namespace"]
                regex         = "kube-system|kube-node-lease|kube-public|cattle-fleet-system|cattle-impersonation-system|cattle-system|cert-manager|velero|monitoring|kyverno|trivy-operator"
                action        = "drop"
            }
        }

        local.file_match "kubernetes_pods" {
            path_targets = discovery.relabel.kubernetes_pods.output
        }

        // -- Configuration de la source de logs pour les pods Kubernetes
        loki.process "kubernetes_pods" {
            forward_to = [loki.write.default.receiver]

            stage.cri { }

            stage.drop {
                expression = ".*(\\/health|\\/metrics|\\/ping).*"
            }
        }

        loki.source.file "kubernetes_pods" {
            targets               = local.file_match.kubernetes_pods.targets
            forward_to            = [loki.process.kubernetes_pods.receiver]
            legacy_positions_file = "/run/promtail/positions.yaml"
        }

        loki.write "default" {
                endpoint {
                        url = "http://loki-write:3100/loki/api/v1/push"
                }
                external_labels = {}
            }

    clustering:
      # -- Déployer Alloy en mode cluster pour permettre la distribution de charge
      enabled: true
    # -- Niveau de stabilité minimum des composants à activer
    # Doit être l'un des suivants: "experimental", "public-preview", ou "generally-available"
    stabilityLevel: "generally-available"

    extraPorts:
      - name: otlp-grpc
        port: 4317
        targetPort: 4317
        protocol: TCP
      - name: otlp-http
        port: 4318
        targetPort: 4318
        protocol: TCP

    mounts:
      # -- Mount /var/log from the host into the container for log collection.
      varlog: true
      # -- Mount /var/lib/docker/containers from the host into the container for log
      # collection.
      dockercontainers: false

    resources: {}

  serviceMonitor:
    enabled: false
    # -- Labels additionnels pour le serviceMonitor
    additionalLabels:
      release: prometheus
    # -- Intervalle de scrape. Utilise l'intervalle par défaut de Prometheus si non défini
    interval: ""
    # -- Configuration de relabeling des métriques après le scraping, mais avant l'ingestion
    # Référence: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#relabelconfig
    metricRelabelings: []
    # - action: keep
    #   regex: 'kube_(daemonset|deployment|pod|namespace|node|statefulset).+'
    #   sourceLabels: [__name__]

    # -- Configuration de relabeling avant le scraping
    # Référence: https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#relabelconfig
    relabelings: []
    # - sourceLabels: [__meta_kubernetes_pod_node_name]
    #   separator: ;
    #   regex: ^(.*)$
    #   targetLabel: nodename
    #   replacement: $1
    #   action: replace

Loki (ansibelised) config:

# loki.yml
# Variables pour le déploiement de Loki
# liens utiles: https://grafana.com/docs/loki/latest/configure/ ; https://grafana.com/docs/loki/latest/setup/migrate/migrate-to-tsdb/
# Pour voir la configuration de Loki du configmap: kubectl -n loki get configmap loki -o yaml
# Pour voir la configuration entière de loki compilée: kubectl exec -it "loki-write" -n loki -- cd /etc/loki/config && loki -print-config-stderr
loki_vars:
  # Architecture --------------------
  deploymentMode: SimpleScalable
  write:
    replicas: "{{ loki_write_replicas | default(3) }}"
    resources: "{{ loki_write_resources | default('{}') }}"
    persistence:
      storageClass: csi-cinder-high-speed
      size: "{{ loki_writepvc_size | default('20Gi') }}"
  read:
    replicas: "{{ loki_read_replicas | default(3) }}"
    resources: "{{ loki_read_resources | default('{}') }}"
    persistence:
      storageClass: csi-cinder-high-speed
      size: "{{ loki_readpvc_size | default('20Gi') }}"
  backend:
    replicas: "{{ loki_backend_replicas | default(3) }}"
    resources: "{{ loki_backend_resources | default('{}') }}"
    persistence:
      storageClass: csi-cinder-high-speed
      size: "{{ loki_backendpvc_size | default('20Gi') }}"
  chunksCache:
    enabled: true
    replicas: "{{ loki_chunkscache_replicas | default(1) }}"
    resources: "{{ loki_chunks_cache_resources | default('{}') }}" # By default a safe memory limit will be requested based on allocatedMemory value (floor (* 1.2 allocatedMemory))
    batchSize: 4
    parallelism: 5
    timeout: 2000ms
    allocatedMemory: "{{ loki_chunks_cache_allocatedmemory | default(1024) }}" # in MB, default is 8192
    maxItemMemory: 5
    writebackSizeLimit: 500MB
    writebackBuffer: 500000
    writebackParallelism: 1
    persistence:
      enabled: false
      storageClass: csi-cinder-high-speed
      storageSize: "{{ loki_chunkspvc_size | default('20Gi') }}"
  gateway:
    enabled: true
  # The Loki canary pushes logs to and queries from this loki installation to test
  lokiCanary:
    enabled: true
  # Loki configuration --------------------
  # https://grafana.com/docs/loki/latest/configure/
  loki:
    image:
      tag: 3.4.2
    tracing:
      enabled: true
    auth_enabled: false # Désactive le Multi Tenant, l'ID sera "fake". Si multi tenant alors il faut utiliser un header "X-Scope-OrgID" https://grafana.com/docs/loki/latest/operations/multi-tenancy/
    commonConfig:
      replication_factor: 3 # Doit être le même que le nombre de réplicas de Loki (integer)
    # Stockage --------------------
    storage:
      bucketNames:
        # -- Nom des buckets pour le stockage des données
        # -- Ces noms sont utilisés pour le stockage des données dans S3
        # When deploying Loki using S3 Storage DO NOT use the default bucket names; chunk, ruler and admin. Choose a unique name for each bucket!
        chunks: "{{ loki_chunks_bucketnames | default('loki') }}"
        ruler: "{{ loki_ruler_bucketnames | default('loki') }}"
        admin: "{{ loki_admin_bucketnames | default('loki') }}"
      type: s3
      s3:
        s3: "{{ loki_s3 | default('null') }}" # s3://access_key:secret_access_key@custom_endpoint/bucket_name
        endpoint: "{{ loki_s3_endpoint | default('https://s3.gra.io.cloud.ovh.net/') }}"
        region: "{{ loki_s3_region | default('gra') }}"
        secretAccessKey: "{{ loki_s3_secretaccesskey | default('') }}"
        accessKeyId: "{{ loki_s3_accesskeyid | default('') }}"
        signatureVersion: "{{ loki_s3_signatureversion | default('v4') }}"
        s3ForcePathStyle: "{{ loki_s3_s3forcepathstyle | default('true') }}"
        insecure: false
        http_config: {}
        # -- Check https://grafana.com/docs/loki/latest/configure/#s3_storage_config for more info on how to provide a backoff_config
        backoff_config: {}
        disable_dualstack: false
      filesystem:
        chunks_directory: /var/loki/chunks
        rules_directory: /var/loki/rules
        admin_api_directory: /var/loki/admin
      object_store:
        # Type of object store. Valid options are: s3, gcs, azure
        type: s3
        # Optional prefix for storage keys
        storage_prefix: null
        # S3 configuration (when type is "s3")
        s3:
          endpoint: "{{ loki_s3_endpoint | default('https://s3.gra.io.cloud.ovh.net/') }}"
          region: "{{ loki_s3_region | default('gra') }}"
          access_key_id: "{{ loki_s3_accesskeyid | default('') }}"
          secret_access_key: "{{ loki_s3_secretaccesskey | default('') }}"
          insecure: false
          # Optional server-side encryption configuration
          sse: {}
          # Optional HTTP client configuration
          http: {}
    schemaConfig:
      configs:
        - from: "2025-04-16"
          store: tsdb
          object_store: filesystem
          schema: v13
          index:
            prefix: index_
            period: 24h
        - from: "2025-05-11"
          object_store: s3
          store: tsdb
          schema: v13
          index:
            prefix: index_
            period: 24h
    ingester:
      wal:
        enabled: true
        dir: /var/loki/wal
      chunk_encoding: snappy
      # The targeted _uncompressed_ size in bytes of a chunk block When this threshold is exceeded the head block will be cut and compressed inside the chunk. default = 262144
      chunk_block_size: 262144 # 256 KiB
      # How long chunks should sit in-memory with no updates before being flushed they don't hit the max block size.  default = 30m
      chunk_idle_period: 5m
      # How long chunks should be retained in-memory after they've been flushed. default = 0s
      chunk_retain_period: 0s
    compactor:
      compaction_interval: 10m
      retention_delete_delay: 2h
      retention_delete_worker_count: 150
      retention_enabled: true
      delete_request_store: s3
    ingester_client:
      grpc_client_config:
        max_recv_msg_size: 104857600 # 100 Mb
        max_send_msg_size: 104857600 # 100 Mb
    limits_config:
      allow_structured_metadata: true
      max_cache_freshness_per_query: 10m
      max_query_parallelism: 32 # default 32
      tsdb_max_query_parallelism: 128 # default 128
      max_entries_limit_per_query: 5000
      max_query_lookback: "{{ loki_retention_period | default('720h') }}" # must match retention_period
      retention_period: "{{ loki_retention_period | default('720h') }}" # defaut 30 jours
      query_timeout: 300s
      reject_old_samples: true
      reject_old_samples_max_age: 168h # defaut 7 jours
      split_queries_by_interval: 1h
      volume_enabled: true
    querier:
      extra_query_delay: 250ms
      max_concurrent: 10 # Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
    query_scheduler:
      max_outstanding_requests_per_tenant: 4096
    server:
      http_listen_port: 3100
      grpc_listen_port: 9095
      http_server_read_timeout: 600s
      http_server_write_timeout: 600s
      grpc_server_max_recv_msg_size: 104857600 # 100 Mo
      grpc_server_max_send_msg_size: 104857600 # 100 Mo
      grpc_server_max_concurrent_streams: 1000
    pattern_ingester:
      enabled: true
  # Monitoring Loki metrics  --------------------
  monitoring:
    dashboards:
      enabled: true
    serviceMonitor:
      enabled: true
      labels:
        release: prometheus

Any helps or recommandations would by appreciate because i have no ideas what to change.
I don’t understand how instable this loki conf is.

Loki-stack was so easy to use and this one is a pain.

Priority is to make my loki-write running again.

I precise too that when i update the release, the loki-write ones stay in Terminating for a loooong time, i usually have to do : k delete pod loki-write-x --force.

I think i resolved my problem by deleting:
- from: “2025-04-16”
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h