Loki times out when querying larger data sets

tomasjansson · March 22, 2023, 9:58am

We have ingested our nginx logs to loki so we can analyze the request time. When we have small volumes everything works as expected, but with larger volumes things times out. Initially we got some issue with “too large series” or something, but after we bumped that we know get timeouts instead.

What we think is strange is that even if we change the time range to something really small we still have some problems. The range we’ve tested only yields around 500 rows, but the following query still fails:

count(rate({app="app-name"} |= "request_time" | pattern "<_> "<request>", <_> "<status>", <_> "<user_agent>", <_> <request_time> }" | request_time < 0.3 [5m]))

It feels like we are doing something wrong to get the issue. I would expect it to be possible to get a result for 24 hours, which would equal to around 1.5-2M log entries… but maybe my expectation are to high?

tonyswumac · March 22, 2023, 3:44pm

There are some other posts on this forum regarding optimization, might be worth a read. Also some information on what your existing setup looks like would help.

tomasjansson · March 22, 2023, 9:11pm

Thank you for your reply Tony! I did try to search for it, but couldn’t find anything that helped, that is why I posted it. We’ve tried everything here: Watch: 5 tips for improving Grafana Loki query performance | Grafana Labs.

The setup is loki running in a kubernetes cluster, don’t have the config available but can get it if needed. We are ingesting the http access log from an nginx app using promtail, and that works fine. Querying the log also works fine, even over the full set of 24 hours. The problem is when I try to add rate and count.

I really don’t understand why, since even when we limit the time range so that it includes 500 entries or so without count, it will still fail when applying count on this 500 rows. It feels that it is something we’ve missed conceptually of how loki works when querying the data.

tonyswumac · March 22, 2023, 9:31pm

First I’d try to narrow down why it fails to even return a small set of data. Logs from querier should hopefully provide some insight. Since querying logs directly works fine, I’d guess it’s likely not related to resources.

Also, your query also looks incorrect. You’d either do count or rate, not both.

tomasjansson · March 22, 2023, 10:08pm

I think you can do count and rate at the same time? Doesn’t rate split the logs into “batches”, then I do count within those batches. At least that is the idea. The rate option was also something I got from the query builder when I selected count.

I’ve tried to narrow it down, but fail in finding the real cause… that’s why I’m posting here , since I’m not sure how to go about debugging this behavior.

tonyswumac · March 22, 2023, 11:09pm

While I don’t think that’s the reason it’s failing, I also don’t think that’s what you are looking for.

From doc:

rate(unwrapped-range): calculates per second rate of the sum of all values in the specified interval.

This will return a number of series, depending on how many streams your selector result to. If you add count() on top, you are really then counting number of “series” returned by rate, not number of logs.

If you are looking to count number of log, I’d recommend checking out count_over_time function. Also, try run your query one level at a time, and check querier / query frontend (if you have one) and see what errors you get.

tomasjansson · March 23, 2023, 7:37am

The way you describe count vs count_over_time makes sense. However, it doesn’t completely make sense with what I see in the result. If count with rate is counting the number of series generated by rate, shouldn’t that yield in a single number of the whole selected range? We actually get results that we can plot in the graph.

Really appreciate your answers! Thank you!

tonyswumac · March 23, 2023, 4:21pm

If you do count on top of rate you should get a single number on the range, unless the number of series changes during the range (it can happen if your label selector isn’t exhaustive).

From my own test:

Query: rate({app="loki",aws_account_alias="<MY_ACCOUNT>"} [10m])

Query: count(rate({app="loki",aws_account_alias="<MY_ACCOUNT>"} [10m]))

system · March 22, 2024, 4:21pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Timeouts with suspicious trace when running queries over larger ranges Grafana Loki	1	365	September 9, 2023
504 Gateway Time-out nginx/1.27.2 Timeout Issues When Using Large Query Limits Grafana Loki loki , datasource , grafana	1	158	December 13, 2024
Logging a million lines per minute Grafana Loki	3	2794	February 20, 2023
How to properly scale Loki queries with lots of data Grafana Loki loki	2	2556	November 24, 2022
Loki queries within grafana dashboards Grafana Loki	2	2343	August 30, 2023

Loki times out when querying larger data sets

Related topics