Configure gossip on Scaled Grafana Loki on different VMs running docker

Hi

I am running loki as a Single binary and have 2 VMs with the exakt same configuration where a load balancer is on top to balance incoming data. Receiving data is not a problem at al but both VMs cache their data on its own discs and when you explore with grafana you will get different results. Logs 1,3,5,7,9 is stored on loki-0 and logs 2,4,6,8,10 are stored in loki-1 and that will also be the results when you query, you will not get synked logs but only logs from one cache.

Yes we will store the on azure blob but unfortunally that takes time to synch, (don’t know why it takes time, more than 45 minutes to an hour to flush. Can this be configured somewhere?)

I have tried to configured gossip by adding :

memberlist:
  join_members:
    - loki:7946
common:
  replication_factor: 1
  ring:
    kvstore:
      store: memberlist
  storage:
    azure:
      account_name: 
      account_key: 
      container_name: loki

I can see on my /ring that their is 1 instance representing lok-0. I refresh then I will se another instance wich represent loki-1 which feel weird. I thought I should get both instances here but it seems like both loki instances created their own ring. Maybe I am totally wrong?

Anyway is there a way forward and more guides to configure two Grafana loki VMs? I have only seen this work with dockercompose so would appreciate on just VMs. In my case I have many VMs running docker which maybe i a problem?

Hopas someone can point me to the right direction!

BR
/Dimitrios

  1. Your ring membership needs to include all members. You either include two DNS records in there (one for each instance), or you use one DNS record that has two values. If you expect to add more instances in the future you might want to look into auto discovery.

  2. Loki flush is controlled by several factors, such as chunk_block_size, chunk_idle_period, and max_chunk_age. See Grafana Loki configuration parameters | Grafana Loki documentation.

I have a DNS with 2 values representing in this case 2 loki instances. We will maybe have more values depending on how we whant to scale but all loki instances shall be in the same DNS record.

The DNS record is called “loki” so I have tried the above and loki.exampleurl.com:7946, http://loki.exampleurl.com:7946 but nothing seems to work. Gossip don’t seems to be able to communicate cross the VMs. Port 7946 is exposed when docker run is executed so I dont know if the VM block traffic or the DNS.

Any thoughts? Everything is running on Azure.

I also tried to hardcode the ip addresses and added several records to the list. DId not work either but maybe that is totally wrong thing to do.

You mentioned gossip port can’t communicate across the two VMs, perhaps start your investigation there. Check firewall rules, listening ports, etc.

I found the issue with the port and solved it. Not the port is open and connection is ok.

I ran the sam configuration as above using 1 record with many values.
It loks like that they try to merge their rings but will first end up as pending and then unhealthy.
Is there any CLI our log files for the hashrings? I have only found some minor logs in /var/logs/syslog but not enough to understand the problem.

Any idea how to proceed?

If you are sure all the connections are working and configurations are correct, perhaps try restarting all containers / pods.

Already tried restarting them.

Seems like the loki instances will successfully join the ring but after a while they will get unhealthy.
Then they will become unhealthy after a while.

Logs:
caller=memberlist_logger.go:74level=debug msg=“Initiating push/pull sync with: 10.xx.y.zzz:7946”
caller=memberlist_logger.go:74level=debug msg=“Initiating push/pull sync with: 10.xx.z.yyy:7946”

caller=memberlist_logger.go:74 level=debug msg=“Failed UDP ping: 424c9546451-54f027e3 (timeout reached)”
caller=memberlist_logger.go:74 level=debug msg=“Stream connection from=127.0.0.1:47282”
aller=memberlist_logger.go:74 level=warn msg=“Got ping for unexpected node 424c9546451-54f027e3 from=127.0.0.1:47282”
caller=memberlist_logger.go:74 level=error msg=“Failed fallback TCP ping: EOF”
caller=memberlist_logger.go:74 level=info msg=“Suspect 424c9546451-54f027e3 has failed, no acks received”

Any ideas where the problem could be?

msg="Failed UDP ping: 424c9546451-54f027e3 and msg="Got ping for unexpected node jump out at me. I’d double check all writer and readers have connectivity to one another, and whatever you configure for memberlist can be discovered to all nodes.

I think the issue is that we memberlist is using UDP multicast for healtchecks which is not allowed within my network. As I understand memberlist uses TCP for streaming data and UDP for healthchecks. Is there a way to configure memberlist to only use tcp?

Perhaps someone more knowledgable can comment on this, but I am fairly certain it fails back to TCP if UDP ping doesn’t work. I know for certain we don’t do UDP within our network either.

I’d recommend you to double check your network connectivity and your service discovery.

I have also tried to configure more frequent flushes by lowering the parameters you suggested.

chunk_block_size , chunk_idle_period , and max_chunk_age.

But there no like force flag to always flush every 1 minute or so or even faster?

Another question is also if there is a read from remote storage configuration that I need to set, so every query always query both cache and remote storage? I see that all data is flushed to remote storage but only the latest loki instance that flushed data will see all data since the other instances has already flushed and read from remote storage.

  1. You should want to have fewer number of chunks, so flushing quickly (like every one minute) doesn’t make sense. This is especially true if using object storage as backend.

  2. All your ingesters should see all data, even cached or in-memory chunks between each other. If some of your ingesters can’t see some data then they aren’t communicating between each other.

1 Like

We are using Azure were UDP multicast is not supported. I still have not been able to swich to TCP for the health checks. Any ideas from somebody else how to configure memberlist to do all health checks using TCP?

I switched from using memberlist to consul which works perfect (TCP only).