Loki Querier not able to query Ingester?

Hello everybody,

I am currently evaluating Loki as a Replacement for Graylog on Kubernetes but run into a problem.
When I am are searching Loki for Logs, I can only see the Logs after an delay of up to 2 hours.

In Graylog I dont have this delay.
It is also not an UTC-Issue, the Logs realy take that long

Analysis:
In the Logs I can see, that the Query is correctly passed to the Querier by the Query-Frontend and that the returned Lines match what I get in Grafana

  loki level=info ts=2021-05-17T14:27:21.271252651Z caller=metrics.go:91 org_id=fake traceID=7e82bca7d79e8136 latency=fast query="{bosh_deployment=\"bosh\"}" query_type=limited range_type=range length=1h0m1s step=2s duration=16.114392ms status=200 limit=1000 returned_lines=0 throughput=0B total_bytes=0B

In Cassandra, I can see the the Tables for Index and Chunks have been created

  cassandra@cqlsh:loki> describe tables;
  loki_index_2679  loki_chunk_2680  loki_index_2680  loki_chunk_2679

According to the table_manager, this should be right

loki level=info ts=2021-05-17T14:50:27.048330202Z caller=table_manager.go:324 msg="synching tables" expected_tables=4

My Guess would be, that the querier is not able to query the ingester.
I tested the connections using nc, everything seems to be fine

loki@logsearch-loki-distributed-querier-0:/$ nc -zv logsearch-loki-distributed-ingester 3100
Connection to logsearch-loki-distributed-ingester 3100 port [tcp/*] succeeded!
loki@logsearch-loki-distributed-querier-0:/$ nc -zv logsearch-loki-distributed-ingester 9095
Connection to logsearch-loki-distributed-ingester 9095 port [tcp/*] succeeded!

I also checked the ring-page, all 3 Ingester are marked as ACTIVE.

The Setup:
We are using FluentD as an Log-Aggregator which currently sends the Logs to Loki and Graylog using the Copy-Directive.
Loki is deployed based on the Loki-Distruted Helm-Chart ( helm-charts/charts/loki-distributed at main · grafana/helm-charts · GitHub )
Currently we have:
1 x Distibutor
1 x Gateway
3 x Ingester
3 x Querier
1 x Query-Frontend
1 x Table-Manager

As Index and Chunk-Storage, Apache Cassandra is used.
The chart is installed using Helm-3 as installname logsearch is used.

The config:

  auth_enabled: false

  server:
    http_listen_port: 3100

  distributor:
    ring:
      kvstore:
        store: memberlist

  memberlist:
    join_members:
      - logsearch-loki-distributed-memberlist

  ingester:
    lifecycler:
      ring:
        kvstore:
          store: memberlist
        replication_factor: 1
    chunk_idle_period: 30m
    chunk_block_size: 262144
    chunk_encoding: snappy
    chunk_retain_period: 1m
    max_transfer_retries: 0

  limits_config:
    enforce_metric_name: false
    reject_old_samples: true
    reject_old_samples_max_age: 168h
    max_cache_freshness_per_query: 10m

  schema_config:
    configs:
    - from: 2020-12-14
      store: cassandra
      object_store: cassandra
      schema: v11
      index:
        prefix: loki_index_
        period: 168h
      chunks:
        prefix: loki_chunk_
        period: 168h

  storage_config:
    filesystem:
      directory: /var/loki/chunks
    cassandra:
      addresses: logsearch-cassandra-headless
      auth: true
      username: foo
      password: bar
      keyspace: loki
      replication_factor: 2
      timeout: 10s
      connect_timeout: 6s

  table_manager:
    retention_deletes_enabled: true
    retention_period: 1w

  querier:
    query_ingesters_within: 2h

  query_range:
    align_queries_with_step: true
    max_retries: 5
    split_queries_by_interval: 15m
    parallelise_shardable_queries: true
    cache_results: true
    results_cache:
      cache:
        enable_fifocache: true
        fifocache:
          max_size_items: 1024
          validity: 24h

  compactor:
    shared_store: filesystem

  frontend:
    downstream_url: http://logsearch-loki-distributed-querier:3100
    log_queries_longer_than: 5s
    compress_responses: true  

So far tested:
Change the grpc-worker in the querier to http. => no succes
Scale up the ingester/querier => no success
Remove Graylog from FluentD (Just to be sure) => no success

My main Problem is, that I cant see any Problems in the Logs (There are no errors, not even warnings) nor am i realy familar with Loki (yet),
therefore I font know where to start troubleshooting

Did someone encounter this problem and know how to solve it?

Thanks and best regards
Christoph

Hello Everybody,

we were able to fix the Issue.
The problem was in FluentD not Loki, we were able to see Error-Logs in Fluentd of kind

error_class=Net::OpenTimeout error="execution expired"

which indicated a network-problem.

Thanks and best regards
Christoph