I used Logcli to run a 1 hour metric query, Logcli --stats output stats the query proccessed 5.0GB of uncompressed bytes.
This doesn’t make sense, my cluster ingests as a whole around 300KB per second, so even if the query pulls every single chunk ingested in the 1 hour time range i’m querying the query would only need to process 1.08GB not to mention i’m filtering the logs by the filename label.
Creating Grafana panels of Loki metric queries is unusable without caching, each query takes forever, I think a good place to start tuning this is understanding why each query processes so much data.
I’m using boltdb-shipper as the index store and s3 as the chunk store. I thought that there is some problem with how index files are saved, that they might be pointing to larger than desired sets of data so I added a compactor instance in an attempt to prevent duplicate chunks from being loaded by queries. This indeed helped and the same query reported to process 3.6GB of uncompressed bytes.
Any idea on how exactly chunks are loaded by queriers? Or general thoughts on how to decrease queries time.
We are running Loki on vm’s, 2 query frontends, 3 queriers (with 12 cpu’s each), 4 ingesters, 2 distributors, 1 compactor.
- from: 2020-06-10
I think I am on the same boat for now. As you mentioned, you are running the loki cluster in vm, is that in local build or on any K8S/dockerized env?
We are running distributed Loki directly on the vm (exceuting a binary).
Solving the slow query times was done in a few steps.
We were saving logs in an on premise object storage that uses S3. However we were getting around 30mb/s of download speed (this is just the speed our hardware can give) this speed isn’t good enough to run metric queries fast so we switched to saving data in a Cassandra cluster we deployed. Cassandra saves its data on disk which gave us much better performance.
This step correlates directly with the question I posted. Loki build up log data in memory inside chunks, each chunk represents a log stream (a specific combination of lables). At query time Loki pulls the entire log stream leading to it processing more data than relevant to the query. The solution was to use regex in Promtail to extract fields at send time, leading to a more specific log stream (less data processed) and a decrease in processing work that needs to be done at query time. The reason we didn’t do this at the start is that Loki recomends to yse static labels as much as possible since extracting a field like an ip address can build up a lot of log streams in memory, leading to a big index and cluster wide long query times. However in our use case, extracting very specific fields on very specific targets didn’t have a negative affect.
If you are a heavy dashboard user and is looking to use Loki mostly for this feature I would recommend using the ELK stack instead. Loki can’t compete, at least from my experience, with ELK’s speed when it comes to metric queries.
The reason we went with Loki is that we are already heavily invested in prometheus and as a result most of our observabillity dashboards use metrics from prometheus, while Loki will mostly be used in the explore tab in grafana.
Hope This helps.
Thank you so much for the details.
I have some confusions if you do not mind checking them out:
- Per the 2nd point in your answer, does that mean the more labels created in the config file the more chunks will be created in the stores since the combinations are getting bigger?
- I am running the local build as well in the VM by following https://github.com/grafana/loki/tree/main/production/docker, the querier, ingester and distributor are running in the same instance of loki on 1 vm, but how do you configure the number of each of them like 3 queriers, 4 ingesters (replica factors?) and 2 distributors?
- Well, we are still in the research stage, ELK will be the alternative if Loki is not that as expected, thanks for this suggestions.
Regarding 2nd point, i deployed a new env with 2 query-frontend vms as well. But i got one question, which is the 2 frontends are running on differnet vms and how to set the
frontend_address in the config file?
This was actually a choke point for us.
Since we aren’t using k8s we had to deploy another vm that acts as a load balancer (and a second one for failover), we use haproxy load balancer.
We use 1 load balancer to distribute traffic from promtail to the distributors and to the query frontends. You can configure 1 haproxy to distribute traffic to 2 backend groups according to the url that was used.
Whild we use round robin as the load balancing method to our distributors, we distribute traffic to our frontends in an active passive method (only 1 is active). While no part of the documentation I could find supports this so we might be wrong, we found out that a single query uses only halth of the queriers available cores (round robin between 2 frontends) also the connected workers metric to each frontend showed halth of the cluster querier cores are connected to each frontend. Since we have a fairly small deployment we decided we want to use every core for every query and we went the active passive way. I guess the case is different on larger deployments.
Thanks for the reply.
I tried the loadbalancing with nginx for the 2 frontend running on 2 VMs. I did not use downstream_url but the frontend_worker since there is a discussion in another topic that this way could bring some more parallesium.
However, the configuration is not that simple as expected. As you know that the fronend_worker shoud point to the grpc port of the frontend, then how about the port 3100 for http queries?
I posted this out in another topic https://community.grafana.com/t/how-to-configure-nginx-for-frontend-worker/52306, your suggestion is highly welcomed.
Btw, I set a memberlist for ingesters and distributors and configure the nginx for them as well, that looks fine for now.