Hello everybody,
I am currently evaluating Loki as a Replacement for Graylog on Kubernetes but run into a problem.
When I am are searching Loki for Logs, I can only see the Logs after an delay of up to 2 hours.
In Graylog I dont have this delay.
It is also not an UTC-Issue, the Logs realy take that long
Analysis:
In the Logs I can see, that the Query is correctly passed to the Querier by the Query-Frontend and that the returned Lines match what I get in Grafana
loki level=info ts=2021-05-17T14:27:21.271252651Z caller=metrics.go:91 org_id=fake traceID=7e82bca7d79e8136 latency=fast query="{bosh_deployment=\"bosh\"}" query_type=limited range_type=range length=1h0m1s step=2s duration=16.114392ms status=200 limit=1000 returned_lines=0 throughput=0B total_bytes=0B
In Cassandra, I can see the the Tables for Index and Chunks have been created
cassandra@cqlsh:loki> describe tables;
loki_index_2679 loki_chunk_2680 loki_index_2680 loki_chunk_2679
According to the table_manager, this should be right
loki level=info ts=2021-05-17T14:50:27.048330202Z caller=table_manager.go:324 msg="synching tables" expected_tables=4
My Guess would be, that the querier is not able to query the ingester.
I tested the connections using nc, everything seems to be fine
loki@logsearch-loki-distributed-querier-0:/$ nc -zv logsearch-loki-distributed-ingester 3100
Connection to logsearch-loki-distributed-ingester 3100 port [tcp/*] succeeded!
loki@logsearch-loki-distributed-querier-0:/$ nc -zv logsearch-loki-distributed-ingester 9095
Connection to logsearch-loki-distributed-ingester 9095 port [tcp/*] succeeded!
I also checked the ring-page, all 3 Ingester are marked as ACTIVE.
The Setup:
We are using FluentD as an Log-Aggregator which currently sends the Logs to Loki and Graylog using the Copy-Directive.
Loki is deployed based on the Loki-Distruted Helm-Chart ( helm-charts/charts/loki-distributed at main · grafana/helm-charts · GitHub )
Currently we have:
1 x Distibutor
1 x Gateway
3 x Ingester
3 x Querier
1 x Query-Frontend
1 x Table-Manager
As Index and Chunk-Storage, Apache Cassandra is used.
The chart is installed using Helm-3 as installname logsearch is used.
The config:
auth_enabled: false
server:
http_listen_port: 3100
distributor:
ring:
kvstore:
store: memberlist
memberlist:
join_members:
- logsearch-loki-distributed-memberlist
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 1
chunk_idle_period: 30m
chunk_block_size: 262144
chunk_encoding: snappy
chunk_retain_period: 1m
max_transfer_retries: 0
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
max_cache_freshness_per_query: 10m
schema_config:
configs:
- from: 2020-12-14
store: cassandra
object_store: cassandra
schema: v11
index:
prefix: loki_index_
period: 168h
chunks:
prefix: loki_chunk_
period: 168h
storage_config:
filesystem:
directory: /var/loki/chunks
cassandra:
addresses: logsearch-cassandra-headless
auth: true
username: foo
password: bar
keyspace: loki
replication_factor: 2
timeout: 10s
connect_timeout: 6s
table_manager:
retention_deletes_enabled: true
retention_period: 1w
querier:
query_ingesters_within: 2h
query_range:
align_queries_with_step: true
max_retries: 5
split_queries_by_interval: 15m
parallelise_shardable_queries: true
cache_results: true
results_cache:
cache:
enable_fifocache: true
fifocache:
max_size_items: 1024
validity: 24h
compactor:
shared_store: filesystem
frontend:
downstream_url: http://logsearch-loki-distributed-querier:3100
log_queries_longer_than: 5s
compress_responses: true
So far tested:
Change the grpc-worker in the querier to http. => no succes
Scale up the ingester/querier => no success
Remove Graylog from FluentD (Just to be sure) => no success
My main Problem is, that I cant see any Problems in the Logs (There are no errors, not even warnings) nor am i realy familar with Loki (yet),
therefore I font know where to start troubleshooting
Did someone encounter this problem and know how to solve it?
Thanks and best regards
Christoph