Logs not available after only a few hours

Hello all,

I’m having an odd problem where my logs become unavailable in Grafana at somewhat fixed points in time. For example, today, the logs before 6 AM vanished. Yesterday, the logs before 1 PM vanished and before that at 1 AM.

It’s possible that it happens every 6 hours or so, but I don’t see why this would be happening.
Find below my current config.

Did anyone see similar behavior?

loki:
  auth_enabled: false # when true, this will set enable multi-tenancy
  commonConfig:
    replication_factor: 2 # Determines how many nodes are required, since we only have 2 AKS nodes, set to max. 2.
  storage:
    bucketNames: # This refers to the containers in azure storage
      chunks: loki-chunks
      ruler: loki-ruler # < not used at this point
      admin: loki-admin # < not used at this point
    type: azure
    azure:
      accountKey: ${AZURE_ACCOUNT_KEY}
      accountName: ${AZURE_ACCOUNT_NAME}
      requestTimeout: 30s

  query_scheduler:
    max_outstanding_requests_per_tenant: 4096

  frontend:
    max_outstanding_per_tenant: 4096

  limits_config:
    ingestion_rate_mb: 16
    ingestion_burst_size_mb: 32
    max_query_parallelism: 32
    reject_old_samples: false
    reject_old_samples_max_age: 1w # <- ignored when old samples are not rejected
    split_queries_by_interval: 30m
    deletion_mode: filter-and-delete
    # Retention settings -> https://grafana.com/docs/loki/latest/operations/storage/retention/
    retention_period: 744h # Global setting when to delete old logs
    retention_stream: # individual setting for streams
      - selector: '{namespace=~"loki|monitoring"}'
        priority: 1
        period: 72h

  compactor:
    retention_enabled: true # Allows deletion of records via HTTP API


gateway:
  basicAuth:
    enabled: true
    existingSecret: loki-gateway-credentials

test:
  enabled: false

monitoring:
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false
  lokiCanary:
    enabled: false
  serviceMonitor:
    enabled: false

read:
  extraEnvFrom:
    - secretRef:
        name: loki-azure-credentials
  extraArgs:
    - "-config.expand-env=true"
  replicas: 2
  resources:
    requests:
      cpu: 0.05
      memory: 6Gi
    limits:
      memory: 6Gi


write:
  extraEnvFrom:
    - secretRef:
        name: loki-azure-credentials
  extraArgs:
    - "-config.expand-env=true"
  replicas: 2
  resources:
    requests:
      cpu: 0.05
      memory: 3Gi
    limits:
      memory: 3Gi

backend:
  extraEnvFrom:
    - secretRef:
        name: loki-azure-credentials
  extraArgs:
    - "-config.expand-env=true"
  replicas: 2
  resources:
    requests:
      cpu: 0.05
      memory: 512Mi
    limits:
      memory: 512Mi

Edit: Turns out that the PVCs of the write pods were full, might have been a problem with repeatedly installing and uninstalling the helm chart and not always clearing the object store. Still a mystery to me, but it did keep the logs retained since I completely cleared everything and did a clean reinstall auf the helm chart.
Now the question that i couldn’t find an answer to: Should the PVC fill up at all or does the content in there get recycled at some point? The PVCs were 10 GB and I had a spike of 23 GB of logs being generated by a service, which may have overwhelmed the nodes. The observed behavior could be a side-effect of that, but i’m really just guessing here.

Best regards
Thomas

1 Like