Performance issue

I have some small logging system for DNS queries and I tried to build some metrics dashboard, like as: top queries hosts and etc… but faced with really poor performance and unclear behaviour for me.
My loki stack was started over this compose file: https://github.com/grafana/loki/blob/master/production/docker/docker-compose-ha-memberlist.yaml
without any changes.

For put logs I am using docker: Docker driver client | Grafana Loki documentation

time logcli query ‘{q_hostname=“dns-cache01”,q_message=“CLIENT_QUERY”,q_type=“A”}’ --since=50m --limit 10 -q --stats
Ingester.TotalReached 3
Ingester.TotalChunksMatched 115
Ingester.TotalBatches 0
Ingester.TotalLinesSent 0
Ingester.HeadChunkBytes 42 kB
Ingester.HeadChunkLines 437
Ingester.DecompressedBytes 0 B
Ingester.DecompressedLines 0
Ingester.CompressedBytes 19 MB
Ingester.TotalDuplicates 0
Store.TotalChunksRef 0
Store.TotalChunksDownloaded 0
Store.ChunksDownloadTime 0s
Store.HeadChunkBytes 0 B
Store.HeadChunkLines 0
Store.DecompressedBytes 0 B
Store.DecompressedLines 0
Store.CompressedBytes 0 B
Store.TotalDuplicates 0
Summary.BytesProcessedPerSecond 7.0 MB
Summary.LinesProcessedPerSecond 72510
Summary.TotalBytesProcessed 42 kB
Summary.TotalLinesProcessed 437
Summary.ExecTime 6.026743ms
2020-11-28T21:44:37+02:00 {} q_host=dns-cache01 q_domain=110.250.188.20.zz.countries.nerd.dk q_src_ip=194.0.200.251 q_type=A q_code=NOERROR
2020-11-28T21:44:37+02:00 {} q_host=dns-cache01 q_domain=gmr-smtp-in.l.google.q_src_ip=178.20.158.192 q_type=A q_code=NOERROR
2020-11-28T21:44:37+02:00 {} q_host=dns-cache01 q_domain=119.206.150.45.zz.countries.nerd.dk q_src_ip=194.0 q_type=A q_code=NOERROR
2020-11-28T21:44:37+02:00 {} q_host=dns-cache01 q_domain=buy-sat q_src_ip=193.200. q_type=A q_code=NOERROR
2020-11-28T21:44:37+02:00 {} q_host=dns-cache01 q_domain=api.dropbox q_src_ip=10.5.0 q_type=A q_code=NOERROR
2020-11-28T21:44:37+02:00 {} q_host=dns-cache01 q_domain=ip102.ip-167-114-25. q_src_ip=178.20 q_type=A q_code=NOERROR
2020-11-28T21:44:37+02:00 {} q_host=dns-cache01 q_domain=ip102.ip-167-114-25 q_src_ip=178.20 q_type=A q_code=NOERROR
2020-11-28T21:44:37+02:00 {} q_host=dns-cache01 q_domain=pulse.sum.im q_src_ip=193.200 q_type=A q_code=NOERROR
2020-11-28T21:44:37+02:00 {} q_host=dns-cache01 q_domain=119.206.150.45.zz.countries.nerd.dk q_src_ip=194.0 q_type=A q_code=NOERROR
2020-11-28T21:44:37+02:00 {} q_host=dns-cache01 q_domain=alt1.gmr-smtp-in.l.google q_src_ip=178.20 q_type=A q_code=NOERROR

But, if I will write something like this:
root@prometheus01:~/compose-dnstap# time logcli query ‘topk(5, sum by (q_domain) (count_over_time({q_hostname=“dns-cache01”,q_message=“CLIENT_QUERY”,q_type=“A”} | logfmt | q_domain =~ “." | q_src_ip =~ ".” [5m])))’ --since=50m --limit 2 -q --stats | jq ..metric.q_domain
Ingester.TotalReached 3
Ingester.TotalChunksMatched 242
Ingester.TotalBatches 0
Ingester.TotalLinesSent 0
Ingester.HeadChunkBytes 161 kB
Ingester.HeadChunkLines 1680
Ingester.DecompressedBytes 238 MB
Ingester.DecompressedLines 2025896
Ingester.CompressedBytes 39 MB
Ingester.TotalDuplicates 0
Store.TotalChunksRef 0
Store.TotalChunksDownloaded 0
Store.ChunksDownloadTime 0s
Store.HeadChunkBytes 0 B
Store.HeadChunkLines 0
Store.DecompressedBytes 0 B
Store.DecompressedLines 0
Store.CompressedBytes 0 B
Store.TotalDuplicates 1011298
Summary.BytesProcessedPerSecond 39 MB
Summary.LinesProcessedPerSecond 330676
Summary.TotalBytesProcessed 238 MB
Summary.TotalLinesProcessed 2027576
Summary.ExecTime 6.13160133s
“alex-car.com.ua”
“alt1.gmr-smtp-in.l.google.coma”
“alt2.gmr-smtp-in.l.google.coma”
“demeter.freehost.com.ua”
“gmr-smtp-in.l.google.cos”
“gh.microsoft”
“myandex.r”
“vido.ua”
“za08.in.ua”
“zabbixcom.ua”

real    0m6.308s
user    0m0.150s
sys     0m0.068s

The most intersting part: in my command I have an option: limit=2 and topk(5), but result of running query ignored all limits

how is it possible to optimize query to create a graph for top queries hosts?

This kind of query can get expensive, as under the hood this is evaluated the same as ho Prometheus range_queries are evaluated.

Basically this entire query:

topk(5, sum by (q_domain) (count_over_time({q_hostname=“dns-cache01”,q_message=“CLIENT_QUERY”,q_type=“A”} | logfmt | q_domain =~ “." | q_src_ip =~ ".” [5m])))

Will be performed starting at the start time of the query, then again at start + step, then again at start + step + step, etc all the way until end time. This blog post i think helps explain this as well.

The reason topk returns more results is because of the nature of how it works: topk is evaluated at each step, this means you can end up with more than 5 results as the top 5 results you are looking for are being evaluated at every step and may be different at each step. (if each step has a different list of top5 then the output series will have way more than 5 results) Here is another blog post that talks about this, however the work around in that result would be a little tedious for Loki as I’m not sure off my head if query_result works on a Loki datasource, so you would have to configure Loki as a prometheus datasource in Grafana (which is possible, Loki presents a prometheus compatible API, you just need to add a \loki suffix in the URL: http://loki-url/loki)

--limit is ignored for metric queries it only applies to log results which is why it has no affect here.

So how can you make this faster? You could explore parallelization with the query-frontend and more queriers.

You could also change the step in your query to a bigger value, and match the range [5m] to the step. Such that Loki does fewer iterations of that query.

Not sure if your desired visualization supports this, but you could consider doing this as an instant query:

 time logcli instant-query ‘topk(5, sum by (q_domain) (count_over_time({q_hostname=“dns-cache01”,q_message=“CLIENT_QUERY”,q_type=“A”} | logfmt | q_domain =~ “." | q_src_ip =~ ".” [50m])))’ -q --stats

Note the other difference here is the range was updated from [5m] to [50m], this will run one single query for the last 50m and give you a single result which is useful for display in a table.

1 Like