Hi Loki gurus,
I have built up a loki instance together with Grafana and Promtail. Owning to the further scalability, I would like scale out the loki distributor and ingester (they are running in one process for now). May I know if there is any guide for this scalability procedure? I cannot use cloud for the compliance factor.
sure. This is a little more evolved. For starters Loki’s architecture is similar to Cortex. That means some operational setting for Cortex also applies to Loki.
That said Loki’s documentation on scaling needs quite some work.
May I ask what platform you are using to run Loki? There’s a Docker compose file for development we can work off and there are ksonnet definitions in the production folder of the repository that could be used with Tanka.
Thank you for the response.
I am running Loki on a Redhat server with local build instead of K8S or Docker (this might be the next stage I think), the agent of Promtail is deployed on other linux servers for log collection.
It would be much appreciated if you could offer any guide for how to configure the scalability for the components of loki.
Thanks in advance.
since you are using Kubernetes I would recommend the installation via Tanka. This should give you some replication of the distributor and ingester. It probably helps to look at the config source. Let me know if you are missing some details so that I can improve the docs.
Thanks for the link.
I am afraid I made some confusion here, what I meant is we are using local deployment on Redhat server but NOT K8S or Docker
Thus I am looking for a guide of how to scale out the distributer and ingester in such env.
Thanks and apologies for any misunderstanding.
Ah. In that case I would take a look at how the services are launched with Docker compose: loki/production/docker at main · grafana/loki · GitHub.
Interestingly it uses a different pattern than the Tanka deployment.
Thought it is for Docker compose, I think it should work as well in local build.
I will check it out and keep you posted.
The configuraiton is awesome except it is running on a single node. However, I finally spun up the 2nd loki instance on another server and joined in the ring with the previous one, it seemd I made some progress here.
But once I got up this morning, i noticed the newly joined loki instance failed to be started up with error:
Aug 14 10:39:31 centos8 loki-linux-amd64: level=error ts=2021-08-14T02:39:31.745179104Z caller=table.go:147 msg="failed to open existing boltdb file /tmp/loki/boltdb-shipper-cache/index_18852/centos8-2-1628876868504294416-1628895600.gz, continuing without it to let the sync operation catch up" err=timeout
Aug 14 10:39:36 centos8 loki-linux-amd64: level=error ts=2021-08-14T02:39:36.746630285Z caller=table.go:147 msg="failed to open existing boltdb file /tmp/loki/boltdb-shipper-cache/index_18852/centos8-2-1628876868504294416-1628896500.gz, continuing without it to let the sync operation catch up" err=timeout
Aug 14 10:39:41 centos8 loki-linux-amd64: level=error ts=2021-08-14T02:39:41.704069572Z caller=table.go:147 msg="failed to open existing boltdb file /tmp/loki/boltdb-shipper-cache/index_18852/centos8-2-1628876868504294416-1628898300.gz, continuing without it to let the sync operation catch up" err=timeout
Aug 14 10:39:46 centos8 loki-linux-amd64: level=error ts=2021-08-14T02:39:46.666277276Z caller=table.go:147 msg="failed to open existing boltdb file /tmp/loki/boltdb-shipper-cache/index_18852/centos8-2-1628876868504294416-1628899200.gz, continuing without it to let the sync operation catch up" err=timeout
Aug 14 10:39:51 centos8 loki-linux-amd64: level=error ts=2021-08-14T02:39:51.625717928Z caller=table.go:147 msg="failed to open existing boltdb file /tmp/loki/boltdb-shipper-cache/index_18852/centos8-2-1628876868504294416-1628901900.gz, continuing without it to let the sync operation catch up" err=timeout
Aug 14 10:39:56 centos8 loki-linux-amd64: level=error ts=2021-08-14T02:39:56.610722528Z caller=table.go:147 msg="failed to open existing boltdb file /tmp/loki/boltdb-shipper-cache/index_18852/centos8-2-1628876868504294416-1628902800.gz, continuing without it to let the sync operation catch up" err=timeout
Aug 14 10:39:56 centos8 loki-linux-amd64: level=info ts=2021-08-14T02:39:56.624376964Z caller=table.go:440 msg="downloading object from storage with key index_18852/centos8-2-1628876868504294416-1628880738.gz"
Aug 14 10:39:56 centos8 loki-linux-amd64: level=info ts=2021-08-14T02:39:56.627450182Z caller=util.go:55 msg="downloaded file index_18852/centos8-2-1628876868504294416-1628880738.gz"
Aug 14 10:40:01 centos8 loki-linux-amd64: level=error ts=2021-08-14T02:40:01.606912862Z caller=log.go:106 msg="error running loki" err="timeout\nerror creating index client\ngithub.com/cortexproject/cortex/pkg/chunk/storage.NewStore\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/chunk/storage/factory.go:179\ngithub.com/grafana/loki/pkg/loki.(*Loki).initStore\n\t/src/loki/pkg/loki/modules.go:319\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:103\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:220\nmain.main\n\t/src/loki/cmd/loki/main.go:132\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374\nerror initialising module: store\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:105\ngithub.com/cortexproject/cortex/pkg/util/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/cortexproject/cortex/pkg/util/modules/modules.go:75\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:220\nmain.main\n\t/src/loki/cmd/loki/main.go:132\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"
As I mentioned previously, this is a local running configuration instead of Docker or K8S, I used nfs as the store for both index and chunks, the mount point is “/tmp/loki” and detailed config for nfs is:
[root@centos8-3 loki]# exportfs -v
I noticed similar issue in Loki failing with msg="error running loki" err="timeout\nerror creating index client - githubmemory but still no solution yet.
Please help check if there is any chance, thanks in advance.
I’m in a similar boat. Started out with a local, single node install. I created a few querier front ends in K8S to try to spread the load some. Even with this, the queriers are hitting the single node during queries and maxing out 12 CPUs. Small query windows up to 6 hours return fairly quickly. Much more than that and they start taking 30+ seconds to return or time out.
I think since I’m fairly early on in my build, it would be beneficial to roll out a better distributed setup. Is there a good way to expand on a single node local install, or should I scrap this completely?
Good to see more people are in this discussion.
I followd the guide mentioned by karsten’s reply, for now the distributers are in 2 different servers and ingesters are configured with 2 replicas (2 loki instances are running as a cluster). However, I did not set up the querier frontend yet since this is only a dev/test env, maybe i will try that later with more log files.
Is there anyway I could test the performance? Please guide me if you have any suggestions.
Btw, I am using nfs shared storage for both index and chunks, maybe I could try Cassandra later.
I didn’t mean to use Docker compose but that you check out how the Loki instances are launched
We need to do a better job at documentation for scaling. I’m fairly new to Loki as well. Let me ask in the team why the Tanka and Docker compose setups are different and get back to you.
Thanks for this.
Just FYI, once I followed the infra in docker composal on my distributed env, it really worked though.
So I took the Helm chart for distributed and deployed it with a few tweaks (1 compactor, 1 distributor, 2 gateway, 3 ingesters, 3 queriers, 3 query front ends, 1 table manager, 1 memcached_chunks, 1 memcached_frontend, 1 memcached index queries) . Once I get queries over a few hours, the response time slows way way way down, especially when I have more than a few graphs on the dashboard. I’m using unified S3 storage with the boltdb shipper and the containers are on-prem. I assume it’s because it needs to pull down the chunks from S3. I also have persistent storage for caching.
BoltDB shipper cache ttl is 7 days
I also have a query range results cache validity period of 24 hours, max size items 1024.
I’m sure there’s a lot of tweaking that can be done.
Thanks for the sharing. It seems your configuration is far more complicated than mine, I use local NFS storage for storing the index and chunks, thus the latency is not an issue for now.
I am not quite familiar with Helm configuration, could you please introduce me more about it? Like for those components (1 distributor, 3 ingesters, 3 queriers, 3 query front ends), are they running in a separate pod? Since I am using local build, so the distributor, ingester and querier are running inside a single process of loki.
So I started with the Loki distributed helm chart for Kubernetes (helm-charts/charts/loki-distributed at main · grafana/helm-charts · GitHub). We have tons of clusters. This chart allows for a distributed install, where each component is run in its own container and you can run multiple pods of each as needed. This allows you to scale out any component in seconds depending on need.
So it turns out that the example config files are essentially bare minimum. You have to follow their config documentation and add a lot to it to make things like memcached work. Which I have now enabled - at least for the front end, but it’s timing out connecting to the memcached instance.
For what it’s worth I started an issue to document scaling Loki.
I’ve also talked to the team to clear up the difference between the Tanka and Docker Compose setup up. The later was a contribution by another user. We suggest to scale out Loki like it’s done with Tanka. Now, the Tanka setup is not easy to grok, i.e. replicate with other tools. That’s why I hope to get some time soon to write out some documentation.
Thanks Karsten for the update, I am looking forward to that. Btw, is there any possibility to write a doc about scalability for non-Tanka/non-docker user?
That’s my intention, or at least to document each component and how it’s run. I hope I can allocate some time to do so.
Thanks Karsten, looking forward to that.
Thanks Max for the details.
Unfortunately, we cannot use K8S or docker for now but pure local build, i have to work out on the infra then.