Hi everyone,
We’re running into an issue where Loki is not responding to simple, targeted queries over a short time period (15 minutes).
The Problem
A basic query like {namespace=“xxx”, component=“yyy”} consistently fails.
Errors We’re Seeing
Grafana Frontend Error:
Get "http://loki-gateway...": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Loki Gateway Logs:
Access logs show HTTP 499 status codes for the query requests.
Querier logs show RPC and scheduler errors:
level=error ts=... component=querier msg="error processing requests from scheduler" err="rpc error: code = Canceled desc = context canceled"
level=error ts=... component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF
Research & Context
We’ve reviewed some GitHub issues that seem related, especially since we have also occasionally seen Resource exhausted errors:
https://github.com/grafana/loki/issues/6568
https://github.com/grafana/loki/issues/7649
Our setup seems to be struggling despite the queries being simple and for a narrow time window. We are looking for advice on what to investigate next. Could this be a bottleneck in the querier/scheduler, a resource allocation problem, or a specific configuration we should tune?
Any help or suggestions would be greatly appreciated!