Hi everyone,
I’m running into a recurring issue with Loki (v3.3.0) in a multi-tenant setup. We store over 600GB of logs across multiple tenants spanning 6 months of retention.
When running the following query on one of the tenants:
count_over_time({env="prod-1"}[15m])
…it crashes the loki-read pod due to OOM , even though we have 8GB memory limit set for that container.
This happens fairly consistently under queries involving even short time windows (e.g. 15 minutes), particularly when using count_over_time.
Our setup:
- Deployed using Helm with simple-scaled mode
- 3x loki-read, 3x loki-write, 3x loki-backend (chunk cache)
- Chunk storage : TSDB, backed by Azure Blob Storage
- Index retention : 24h
- Total log volume: ~600GB over 6 months
My questions:
- What memory-related settings should I tune to allow this type of query to complete successfully?
- Is there any way to limit the amount of data count_over_time pulls into memory?
- Would running the query against a narrower log stream label set help, or does count_over_time still aggregate too much regardless?
Any best practices for query optimization or memory tuning in long-term, multi-tenant setups would be highly appreciated.
Thanks in advance!