Loglines about 20 min delayed

Hello everybody!

I have loki setup with 6 instances (3 x target:read, 3x target:write) in a kubernetes cluster. I have fluent-bit sending service logs to loki. Data is stored in a minio cluster. Grafana, loki and minio are all part of the same k8s cluster, logshipping happens from the same cluster and an additional cluster.

My current problem is, that those logs appearing with a 20 minute delay in the grafana ui. It’s not a problem with timestamps. I also don’t have any error messages from fluent-bit or the loki ingestors.

it’s simply that when I query logs in the grafana ui, only logs about 20 minutes old (and older) are shown.

My guess would be, that the membership ring isn’t working properly and grafana shows only logs, that are allready persisted. But I can’t find a way to debug this.

Hints or tips would be much appreciated!

regards
sebastian

  1. What do you see if you curl the /ring endpoint for both reader and writer?

  2. Try to get into one of the reader container and see if you can hit one of the writer, and vice versa.

  3. Check your configuration for query_ingesters_within.

Hey Tony,
sorry for the delay and thanks for your support.

I do see all 6 instances (3x read plus 3x write):

Well, I can successfully ping read and write instances, which means DNS and service config should be fine.

I can’t actually wget loki-write/ring which might be a hint, I’ll check on this and come back with additional information (I had no other tool then wget at hand within the loki-read container).

I’m using the default value of query_ingesters_within (3h) and I think, that option can’t be a problem, as it borders of queries to the ingester that are older then this value. This is working fine, I can see “old” loglines, I fail to see the most recent ones. I do understand your question, as very low values, like 1m, would be an explanation for the behaviour I’m experiencing.

One additional information, which might interfere here: Allthough the instances are named “loki-read-X”, they are running with target=all, because I can’t connect my grafana to the read instances, when starting them with target=read. (I opened an issue here ) It does work, when starting them with target=all.

I’ll dig into 2) and come back later.

regards
sebastian

One more question: Could it be possible, that I don’t see younger logs, because of some kind of backpressure?

I mean, maybe the write-path isn’t able to stay on time with ingesting the loglines.

Okay, I ran another test:

I went back to a monolithic deployement, which just works fine. No delay whatsoever! What did I do?

  • I used the same statefulset, but with target=all and replicas:1
  • switched ring: config to inmemory
  • bound the write-path-ingress to this single instance (i.e I have still 2 distinct ingressroutes)

I ran more tests:

  • I can scale to 3 (and possibly beyond that) target=all instances and everything’s fine
  • As soon as I introduce a seperate target=write instance, things go wonky

From this I gather, that (as someone allready mentioned someplace else) read instances and write instances are somehow not communicating correctly with each other. Or am I mislead?

Does that help the initiated grafana folks?

So, your problem is with mixing scalable deployment with monolithic deployment. When you do a /ring on writer you should see three members only.

This also hinted at the problem. Because you can see old logs but not new logs, it implies any communication from reader to writer due to query_ingesters_within configuration is not working. This is also why when you did monolithic mode it’s working, because you weren’t running into communication issue anymore.

I would recommend you to re-run the scalable deployment, and see if you can figure out why you can’t connect from reader to writer (ping isn’t good enough, use telnet).

I also did read the issue you posted, I suspect it’s most likely a configuration issue, we used to directly connect from grafana to querier without going through frontend and I did not remember having that problem, I will have to test the helm chart, however, since we don’t actually use the helm chart for our deployment.

By this you mean if I call on /ring I should only see read instances XOR write instances? I was under the assumption, that there should be only one ring, where all services communicate and synchronize on short term things? If I do understand you right, you say there’s actually 2 rings, one for the read instances and one for the write instances?

Interesting. I’m also not using the helm charts, but I used the “result” of the helm chart creation command to modell the deployment. Okay, I’ll backpedal and check everything twice and again, now.

Thanks a lot
sebastian

Actually no. The reader has no ring membership at all (as far as I know). When you query /ring, whether you are doing that against reader or writer, all you should get is the writers. The distribution of read traffic is facilitated by Query Frontends.

The reason you saw three “readers” in your screenshot was because they were actually configured to be writers as well (because of -target=all).

1 Like

Hey all!

I finally got everything working. The really good hint was, figuring out my k8s services so write and read instances work as they are expected.

So I marked that reply as a solution.

I have a working deployment and will figure out all the rest now.

@tonyswumac many thanks for your support!

cheers
Sebastian

1 Like