Data unavailable every other day

Hello there and happy new year :christmas_tree:

I discovered that every other day loki data is not available, I can’t say when it started. You can see this on the screenshot. This is not a grafana bug as I confirmed the same issue with `logcli` command.

The setup is running on kubernetes and deployed using helm. I have the exact same setup on other cluster and get not issue. Here the config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: observability-loki
    meta.helm.sh/release-namespace: observability
  creationTimestamp: "2025-05-30T10:15:32Z"
  labels:
    app.kubernetes.io/instance: observability-loki
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: loki
    app.kubernetes.io/version: 3.6.3
    helm.sh/chart: loki-6.49.0
    helm.toolkit.fluxcd.io/name: loki
    helm.toolkit.fluxcd.io/namespace: observability
  name: loki
  namespace: observability
data:
  config.yaml: |
    auth_enabled: false
    bloom_build:
      builder:
        planner_address: ""
      enabled: false
    bloom_gateway:
      client:
        addresses: ""
      enabled: false
    common:
      compactor_grpc_address: 'observability-loki.observability.svc.cluster.local:9095'
      path_prefix: /var/loki
      storage:
        s3:
          access_key_id: xxxx
          bucketnames: loki-k8s
          endpoint: https://s3.reg.hosting.net/
          http_config:
            insecure_skip_verify: false
          insecure: false
          region: gra
          s3forcepathstyle: true
          secret_access_key: xxxxxxx
    compactor:
      compaction_interval: 4h
      delete_request_store: s3
      retention_delete_delay: 2h
      retention_enabled: true
    frontend:
      max_outstanding_per_tenant: 10000
      scheduler_address: ""
      tail_proxy_url: ""
    frontend_worker:
      scheduler_address: ""
    index_gateway:
      mode: simple
    limits_config:
      max_cache_freshness_per_query: 10m
      query_timeout: 300s
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      retention_period: 365d
      split_queries_by_interval: 24h
      volume_enabled: true
    memberlist:
      join_members:
      - observability-loki-memberlist.observability.svc.cluster.local
    pattern_ingester:
      enabled: false
    query_range:
      align_queries_with_step: true
    query_scheduler:
      max_outstanding_requests_per_tenant: 10000
    ruler:
      storage:
        s3:
          access_key_id: xxxxxxx
          bucketnames: loki-k8s
          endpoint: https://s3.reg.hosting.net/
          http_config:
            insecure_skip_verify: false
          insecure: false
          region: gra
          s3forcepathstyle: true
          secret_access_key: xxxxxxxxxx
        type: s3
      wal:
        dir: /var/loki/ruler-wal
    runtime_config:
      file: /etc/loki/runtime-config/runtime-config.yaml
    schema_config:
      configs:
      - from: "2022-01-11"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v12
        store: boltdb-shipper
      - from: "2024-10-25"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v13
        store: tsdb
    server:
      grpc_listen_port: 9095
      http_listen_port: 3100
      http_server_read_timeout: 600s
      http_server_write_timeout: 600s
    storage_config:
      bloom_shipper:
        working_directory: /var/loki/data/bloomshipper
      boltdb_shipper:
        index_gateway_client:
          server_address: ""
      hedging:
        at: 250ms
        max_per_second: 20
        up_to: 3
      tsdb_shipper:
        index_gateway_client:
          server_address: ""
      use_thanos_objstore: false
    tracing:
      enabled: false

The data is stored on a S3 compatible storage. You can see the last period has more data than usual but it will disappear soon at it does every day. The s3 storage is ok, at least I have the same structure than a working setup

How can I debug this issue? There is so many logs I can find which ones could be relevant with the issue, but in the same time I can’t get logs when there is hole …

If is remove `level=info` I get such entries

2026-01-02 10:49:52.235 error level=error ts=2026-01-02T09:49:52.194116437Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
2026-01-02 10:49:52.180 error level=error ts=2026-01-02T09:49:52.103852212Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
2026-01-02 10:49:52.180 error ts=2026-01-02T09:49:52.103822423Z caller=spanlogger.go:152 user=fake level=error msg="failed downloading chunks" err="failed to load chunk 'fake/388474d1f8d49819/19b7d5e365c:19b7dcc4df4:b7530eb1': failed to get s3 object: operation error S3: GetObject, https response error StatusCode: 0, RequestID: , HostID: , canceled, context canceled"
2026-01-02 10:49:52.180 error level=error ts=2026-01-02T09:49:52.103776274Z caller=parallel_chunk_fetch.go:74 msg="error fetching chunks" err="failed to load chunk 'fake/388474d1f8d49819/19b7d5e365c:19b7dcc4df4:b7530eb1': failed to get s3 object: operation error S3: GetObject, https response error StatusCode: 0, RequestID: , HostID: , canceled, context canceled"
2026-01-02 10:49:40.398received a duplicate entry for ts 1767347379784683107
2026-01-02 10:49:40.280 error level=error ts=2026-01-02T09:49:40.27393468Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
2026-01-02 10:49:40.280 error ts=2026-01-02T09:49:40.272538717Z caller=spanlogger.go:152 user=fake level=error msg="failed downloading chunks" err="context canceled"
2026-01-02 10:49:40.280 error level=error ts=2026-01-02T09:49:40.272455118Z caller=parallel_chunk_fetch.go:74 msg="error fetching chunks" err="context canceled"
2026-01-02 10:49:40.180 error ts=2026-01-02T09:49:40.157035328Z caller=spanlogger.go:152 user=fake level=error msg="failed downloading chunks" err="context canceled"
2026-01-02 10:49:40.180 error level=error ts=2026-01-02T09:49:40.156982489Z caller=parallel_chunk_fetch.go:74 msg="error fetching chunks" err="context canceled"
2026-01-02 10:49:40.017 received a duplicate entry for ts 1767347379410973117
2026-01-02 10:49:36.332 error level=error ts=2026-01-02T09:49:36.280439344Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
2026-01-02 10:49:36.332 error ts=2026-01-02T09:49:36.280405645Z caller=spanlogger.go:152 user=fake level=error msg="failed downloading chunks" err="failed to load chunk 'fake/94c32ce365c88b7b/19b7d62ea87:19b7dd0d340:ce03313e': failed to get s3 object: operation error S3: GetObject, https response error StatusCode: 0, RequestID: , HostID: , canceled, context canceled"
2026-01-02 10:49:36.332 error level=error ts=2026-01-02T09:49:36.280376955Z caller=parallel_chunk_fetch.go:74 msg="error fetching chunks" err="failed to load chunk 'fake/94c32ce365c88b7b/19b7d62ea87:19b7dd0d340:ce03313e': failed to get s3 object: operation error S3: GetObject, https response error StatusCode: 0, RequestID: , HostID: , canceled, context canceled"
2026-01-02 10:49:27.032 error level=error ts=2026-01-02T09:49:26.933560987Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
2026-01-02 10:49:27.032 error ts=2026-01-02T09:49:26.933523168Z caller=spanlogger.go:152 user=fake level=error msg="failed downloading chunks" err="failed to load chunk 'fake/6dbb2d6fade8d1ee/19b7d5d5321:19b7dcbb10a:6df80198': failed to get s3 object: operation error S3: GetObject, https response error StatusCode: 0, RequestID: , HostID: , canceled, context canceled"
2026-01-02 10:49:27.032 error level=error ts=2026-01-02T09:49:26.933476459Z caller=parallel_chunk_fetch.go:74 msg="error fetching chunks" err="failed to load chunk 'fake/6dbb2d6fade8d1ee/19b7d5d5321:19b7dcbb10a:6df80198': failed to get s3 object: operation error S3: GetObject, https response error StatusCode: 0, RequestID: , HostID: , canceled, context canceled"
2026-01-02 10:49:23.680 error level=error ts=2026-01-02T09:49:23.626296112Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
2026-01-02 10:49:23.632 error level=error ts=2026-01-02T09:49:23.59358619Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
2026-01-02 10:49:11.432 error level=error ts=2026-01-02T09:49:11.393932758Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
2026-01-02 10:49:02.132 error level=error ts=2026-01-02T09:49:02.071983983Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
2026-01-02 10:48:52.296 error level=error ts=2026-01-02T09:48:52.195231151Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
2026-01-02 10:48:52.195 error level=error ts=2026-01-02T09:48:52.095284457Z caller=errors.go:26 org_id=fake message="closing iterator" error="context canceled"
2026-01-02 10:48:52.195 error ts=2026-01-02T09:48:52.095262027Z caller=spanlogger.go:152 user=fake level=error msg="failed downloading chunks" err="context canceled"
2026-01-02 10:48:52.195 error level=error ts=2026-01-02T09:48:52.095225838Z caller=parallel_chunk_fetch.go:74 msg="error fetching chunks" err="context canceled"

I checked the error log entries that reports `fake/388474d1f8d49819/19b7d5e365c:19b7dcc4df4:b7530eb1` is missing but I can find it on the storage

mc ls bucket_loki/loki-k8s-prod/fake/388474d1f8d49819/19b7d5e365c:19b7dcc4df4:b7530eb1
[2026-01-02 09:21:40 CET]  99KiB STANDARD 19b7d5e365c:19b7dcc4df4:b7530eb1

I don’t really understand the write/read path for the data and how this processed by the various component. Does someone have clues ?

thanks

Which storage? S3?

Also, when you say it’s missing every other day, does it rotate? For example, in your graph you have data from 12/28 to 12/29, if tomorrow come does that range still have data or does that now go missing?

There are several possibilities. Here is what I would do:

  1. Verify your data are actually written into S3.
  2. Lower your querier count to 1, and see if issue persists.

Also, suspiciously you are splitting query by 24h, so another possibility is that you are pointing your read path to just one reader instead of the query frontend.

Hi @tonyswumac

Yes s3 as stated in message, this is not aws but a storage powered by ceph. We never had issue as of today, and as I said other instances using the same configuration and storage infrastructure is working fine. A ticket was opened with the provider to check the bucket though

From my investigation it does not. I compared between 2 screenshots of the same stream

taken yesterday at 8 A.M

same time range but seen as of now

you can see the part starting around 00:00 eventually disappear, and it happen the same way every other day.

  1. how many readers do you have? If you lower the reader to 1 container does it show the same issue?
  2. how much log do you have roughly?

The error is pretty clear, it cannot download the chunk, for some reason. I would lower the reader count to 1 and see if it works. Also, try exec into a reader container and see if you can get the chunk from inside the container to rule out any potential network issue.

We use the singleBinary helm deployment, with 3 replicas loki and 2 replicas loki-gateways. I assume the reader is on the loki pods, right ? I will scale the loki to one and see if it change something.

We have a retention of 365d on the s3 storage, with currently 1241985 objects, and a size 290.23 GiB.

If you are using single binary then each Loki instance has all components. I would lower it down to just 1 container and see if the problem persists.

Hi

when scaling to 1 then loki does not work anymore, I get

caller=logging.go:144 orgID=fake msg="GET /loki/api/v1/labels?start=1767939272260000000&end=1767942872260000000 (500) 17.617347649s Response: \"too many unhealthy instances in the ring\"" 
caller=logging.go:144 orgID=fake msg="GET /loki/api/v1/labels?start=1767939272260000000&end=1767942872260000000 (500) 17.408910041s Response: \"too many unhealthy instances in the ring\""

it must have 2 replicas at least.

It started from the beginning again the investigation, I found the log on another namespace give a different pattern, the holes in the chart change after each refresh.

see below for an example, I refresh every 2 seconds between the screenshot.

So the initial statement “data missing every other day“ is somewhat not true.

For now, I’ll get back to the s3 storage provider, because I don’t see any other root cause.

Does feel like a storage problem.

While you wait for the provider to get back to you, if you have a metrics platform you are using (prometheus or influxdb, doesn’t matter), try to collect metrics from each Loki container. There are metrics on S3 storage operations that’ll give you the return code, the request time, and other metrics, which might give you a better idea on what the problem might be.

1 Like