Failed to load log volume - "failed mapping AST" messages?

We run a large Loki environment - collecting ~40TB of logs per day, and often querying 24-hrs of logs at a time which can return a log volume graph that reports billions of messages. Obviously this is time consuming and expensive, so we have our environment setup to scale up and down dynamically when users are making these queries.

One of our biggest pain points is when the Grafana “Log Volume” graph simply doesn’t load … or loads up a tiny sliver of data, but doesn’t fully populate. There is no feedback to the user as to why this happens, and it makes them trust the system less.

In a recent situation where this was occuring, I dug through the logs and found that our query-frontend pods were reporting:

ts=2024-07-18T18:34:50.379647467Z caller=spanlogger.go:109 middleware=QueryShard.astMapperware level=warn msg="failed mapping AST" err="context canceled" query="sum by (level) (count_over_time({k8s_namespace_ │ name=\"istio-gateways\"} | drop __error__[1m]))"

I don’t really get any other data though… what can cause these to fail like this? What component would be canceling the context at that point? Could it be Grafana itself?

1 Like

I don’t really have a lot of experiences operating big Loki cluster (we are probably at 2TB logs per day). I’ve found that if you scale up readers when request come in it’s often too late, so we try to estimate our day-time (business hour) usage and scale to maybe 80% of that, and then scale down during off business hours for better user experiences.

You might consider also posting your question in the slack channel, there are many people there with more experiences than I.