We run a large Loki environment - collecting ~40TB of logs per day, and often querying 24-hrs of logs at a time which can return a log volume graph that reports billions of messages. Obviously this is time consuming and expensive, so we have our environment setup to scale up and down dynamically when users are making these queries.
One of our biggest pain points is when the Grafana “Log Volume” graph simply doesn’t load … or loads up a tiny sliver of data, but doesn’t fully populate. There is no feedback to the user as to why this happens, and it makes them trust the system less.
In a recent situation where this was occuring, I dug through the logs and found that our query-frontend
pods were reporting:
ts=2024-07-18T18:34:50.379647467Z caller=spanlogger.go:109 middleware=QueryShard.astMapperware level=warn msg="failed mapping AST" err="context canceled" query="sum by (level) (count_over_time({k8s_namespace_ │ name=\"istio-gateways\"} | drop __error__[1m]))"
I don’t really get any other data though… what can cause these to fail like this? What component would be canceling the context at that point? Could it be Grafana itself?