Logs are gone after flushing off ingester

Hi,

I am running a single binary Loki 3.2.1 on a filesystem TSDB store, where my logs gradually fade out (in a few hours, mostly determined by max_chunk_age) when they are flushing off the ingester. I can no longer access them through Grafana dashboard, despite no error logged.

I came across this issue which seems to be the exact thing I am experiencing. Is there any workaround currently available, apart from downgrading to 2.8.4?

My config:

target: all,write
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info
  grpc_server_max_concurrent_streams: 1000

common:
  instance_addr: 
  path_prefix: /var/lib/loki
  instance_interface_names:
    - eth0
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory
    instance_enable_ipv6: true

ingester: 
  chunk_encoding: zstd
  max_chunk_age: 6h
  chunk_idle_period: 3h
  chunk_target_size: 16777216

ingester_rf1:
  enabled: false

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

frontend:
  encoding: protobuf

compactor:
  working_directory: /var/lib/loki/retention
  compaction_interval: 1h
  retention_enabled: false
  retention_delete_delay: 24h
  retention_delete_worker_count: 150  
  delete_request_store: filesystem

table_manager:
  retention_period: 365d

limits_config:
  retention_period: 365d
  retention_stream: []
  max_query_parallelism: 16
  discover_log_levels: false

From this image, it can be clearly seen that old logs are gradually gone. This corresponds to the flush event on Loki log:

If I decrease max_chunk_age, the logs will be gone faster.

Couple of questions for you:

  1. You are running a single instance of Loki, correct?
  2. What is the filesystem based on? Docker host mounted volume? EFS? Some other storage solution?

Also, this is likely not related to your problem, but you probably want to disable table manager and just use compactor.

  1. You are running a single instance of Loki, correct?

Correct.

  1. What is the filesystem based on? Docker host mounted volume? EFS? Some other storage solution?

Plain ZFS. Assigned by Proxmox VE LXC Container.

This problem occurs before the table manager setting is in place.

I’d say double check and make sure chunks are actually written into your permanent storage (and make sure /var/lib/loki is from outside of container).

If your logs are disappearing after flushing then it’s not written to your chunk storage, or it’s written to the wrong place.

I’d say double check and make sure chunks are actually written into your permanent storage (and make sure /var/lib/loki is from outside of container).

LXC Container is similar to VM. Everything is persistent. Loki is not running in a one-off containers like Docker.

If your logs are disappearing after flushing then it’s not written to your chunk storage, or it’s written to the wrong place.

I could find the index file and the chunk file in the expected place, and the directory size is actually increasing. Unless there is a separated path settings in the querier I’d say it behaves in the expected way.

  1. What is the latest timestamp for your index and chunk files written onto the filesystem?
  2. Try changing target: all,write to target: all

Does your Loki container produce any error from querier or query-frontend when querying older logs?

After tackling with the config file here and there, followed by a hard crash of the entire VM, the problem seems to be fixed.

It looks like the culprit is the instance_addr / instance_interface_names in the common section, which I shouldn’t touch upon when using inmemory ring store.

Also

Does your Loki container produce any error from querier or query-frontend when querying older logs?

There is no error indicated. There is lots of debug log about mock.go Get - deadline exceeded.