Hello all,
I’m having an odd problem where my logs become unavailable in Grafana at somewhat fixed points in time. For example, today, the logs before 6 AM vanished. Yesterday, the logs before 1 PM vanished and before that at 1 AM.
It’s possible that it happens every 6 hours or so, but I don’t see why this would be happening.
Find below my current config.
Did anyone see similar behavior?
loki:
auth_enabled: false # when true, this will set enable multi-tenancy
commonConfig:
replication_factor: 2 # Determines how many nodes are required, since we only have 2 AKS nodes, set to max. 2.
storage:
bucketNames: # This refers to the containers in azure storage
chunks: loki-chunks
ruler: loki-ruler # < not used at this point
admin: loki-admin # < not used at this point
type: azure
azure:
accountKey: ${AZURE_ACCOUNT_KEY}
accountName: ${AZURE_ACCOUNT_NAME}
requestTimeout: 30s
query_scheduler:
max_outstanding_requests_per_tenant: 4096
frontend:
max_outstanding_per_tenant: 4096
limits_config:
ingestion_rate_mb: 16
ingestion_burst_size_mb: 32
max_query_parallelism: 32
reject_old_samples: false
reject_old_samples_max_age: 1w # <- ignored when old samples are not rejected
split_queries_by_interval: 30m
deletion_mode: filter-and-delete
# Retention settings -> https://grafana.com/docs/loki/latest/operations/storage/retention/
retention_period: 744h # Global setting when to delete old logs
retention_stream: # individual setting for streams
- selector: '{namespace=~"loki|monitoring"}'
priority: 1
period: 72h
compactor:
retention_enabled: true # Allows deletion of records via HTTP API
gateway:
basicAuth:
enabled: true
existingSecret: loki-gateway-credentials
test:
enabled: false
monitoring:
selfMonitoring:
enabled: false
grafanaAgent:
installOperator: false
lokiCanary:
enabled: false
serviceMonitor:
enabled: false
read:
extraEnvFrom:
- secretRef:
name: loki-azure-credentials
extraArgs:
- "-config.expand-env=true"
replicas: 2
resources:
requests:
cpu: 0.05
memory: 6Gi
limits:
memory: 6Gi
write:
extraEnvFrom:
- secretRef:
name: loki-azure-credentials
extraArgs:
- "-config.expand-env=true"
replicas: 2
resources:
requests:
cpu: 0.05
memory: 3Gi
limits:
memory: 3Gi
backend:
extraEnvFrom:
- secretRef:
name: loki-azure-credentials
extraArgs:
- "-config.expand-env=true"
replicas: 2
resources:
requests:
cpu: 0.05
memory: 512Mi
limits:
memory: 512Mi
Edit: Turns out that the PVCs of the write pods were full, might have been a problem with repeatedly installing and uninstalling the helm chart and not always clearing the object store. Still a mystery to me, but it did keep the logs retained since I completely cleared everything and did a clean reinstall auf the helm chart.
Now the question that i couldn’t find an answer to: Should the PVC fill up at all or does the content in there get recycled at some point? The PVCs were 10 GB and I had a spike of 23 GB of logs being generated by a service, which may have overwhelmed the nodes. The observed behavior could be a side-effect of that, but i’m really just guessing here.
Best regards
Thomas