Loki Querier Load-balancing Issues

Hello again (thanks for all your help):

I spent several days working on this and found that using the individual DNS name pointing to multiple IPs would not work, but if I point each read node to another read node’s IP for the frontend_address or scheduler_address it connects fine. Right now I’ve tried having them all connect to one-another and I’m still seeing one node being favored more than the others when it’s running at baseline.
–Example–
frontend_worker:
scheduler(or frontend)_address: 10.51.157.9:9443
grpc_client_config:
max_send_msg_size: 104857600.0
parallelism: 10

Interesting. Does seem better, or no? How much of a discrepancy are we talking about? Also make sure your queries are actually sent to the query frontend, not the querier. Although if you are running simple scalable mode this is probably not a problem since they would be running in the same container.

As far as I know, with query frontend, the queries are split on QF, and then queriers pull from QF when using frontend_worker address. And because it’s a pull mechanism whatever discrepancy you see wouldn’t be because of any sort of load balancing.

So after doing some additional testing it appears to be functioning well when running large queries, they are significantly faster. I think we are good. Thanks so much for all your help!

Hi Again:

I was hoping you could help me still. I’ve moved to new infrastructure and now I’ve put all the queriers behind a shared DNS name, but I’m getting the following errors:

Mar 21 18:03:03 az1-irs-o11y-prod-loki-read-01 loki[2280155]: level=error ts=2024-03-21T18:03:03.253198493Z caller=frontend_processor.go:69 msg=“error processing requests” address=10.51.156.34:9443 err=“rpc error: code = Canceled desc = context canceled”
Mar 21 18:03:03 az1-irs-o11y-prod-loki-read-01 loki[2280155]: level=error ts=2024-03-21T18:03:03.556305536Z caller=frontend_processor.go:145 msg=“error processing requests” err=EOF
Mar 21 18:03:03 az1-irs-o11y-prod-loki-read-01 loki[2280155]: level=error ts=2024-03-21T18:03:03.607206314Z caller=frontend_processor.go:145 msg=“error processing requests” err=EOF
Mar 21 18:03:03 az1-irs-o11y-prod-loki-read-01 loki[2280155]: level=error ts=2024-03-21T18:03:03.786873074Z caller=frontend_processor.go:145 msg=“error processing requests” err=EOF
Mar 21 18:03:03 az1-irs-o11y-prod-loki-read-01 loki[2280155]: level=error ts=2024-03-21T18:03:03.790565744Z caller=frontend_processor.go:145 msg=“error processing requests” err=EOF

What do you mean by putting queriers behind shared DNS name? Please share configuration.

frontend:
compress_responses: true
log_queries_longer_than: 15s
frontend_worker:
frontend_address: frontend.loki.it.ufl.edu:9443
grpc_client_config:
max_send_msg_size: 104857600.0
parallelism: 12

Where frontend.loki.it.ufl.edu resolves to the querier IPs in a round-robin fashion. When the service is started it logs that is has connected to the querier frontend and that it is ready, but then spits out the above errors.

  1. You are still running simple scalable mode, yes?

  2. Is 9443 your gRPC port?

That is correct still running simple scalable and 9443 is the gRPC port.

Not sure, can you share your entire Loki configuration, please? Also you might try enabling debug log and reduce the number of reader to 1 for troubleshooting purpose and see if anything obvious pops up.

Loki configuration

loki_version: “2.9.4”

loki_auth_url: loki.it.ufl.edu
loki_cert: “{{ lookup(‘hashi_vault’, ‘secret/data/services/certs/irs/lb.it.ufl.edu:fullchain’ ) }}”
loki_key: “{{ lookup(‘hashi_vault’, ‘secret/data/services/certs/irs/lb.it.ufl.edu:key’ ) }}”

loki_http_port: 8443
loki_grpc_port: 9443

loki_systemd_environment: >-
GOMAXPROCS={{ ansible_processor_vcpus | default(ansible_processor_count) }}
GOGC=20
loki_system_user: loki
loki_system_group: loki
loki_config_dir: /etc/loki
loki_storage_dir: /loki

loki_auth_enabled: true

_max_tenant_throughput_mb: 40

Recommended burst value is 1.5x the max throughput value

_max_tenant_throughput_burst_mb: 60

_max_query_timeout: 600

loki_config:
common:
replication_factor: 3
ring:
kvstore:
store: memberlist
heartbeat_timeout: 10m
storage:
s3:
bucketnames: loki
endpoint: object-prod.ceph.apps.it.ufl.edu
region: default
access_key_id: “{{ lookup(‘hashi_vault’, ‘{{ vault.ceph_ansible_secrets_path }}/{{ deployment.environment }}/object_users/loki:access_key’ ) }}”
secret_access_key: “{{ lookup(‘hashi_vault’, ‘{{ vault.ceph_ansible_secrets_path }}/{{ deployment.environment }}/object_users/loki:secret_key’ ) }}”
insecure: false
s3forcepathstyle: true
http_config:
insecure_skip_verify: true

server:
log_level: debug
http_listen_port: “{{ loki_http_port }}”
http_tls_config:
cert_file: “{{ loki_config_dir }}/ssl/cert.crt”
key_file: “{{ loki_config_dir }}/ssl/cert.key”
http_server_read_timeout: “{{ _max_query_timeout + 10 }}s”
http_server_write_timeout: “{{ _max_query_timeout + 10 }}s”

grpc_listen_port: "{{ loki_grpc_port }}"
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 104857600
grpc_server_max_concurrent_streams: 1500

ingester:
chunk_idle_period: 1h
max_chunk_age: 2h
flush_check_period: 10s
wal:
replay_memory_ceiling: “{{ (ansible_memtotal_mb * 0.75) | int }}MB”

querier:
max_concurrent: 6000
multi_tenant_queries_enabled: true

memberlist:
abort_if_cluster_join_fails: false
bind_port: 7946
join_members: “{{ groups[‘loki’] }}”
max_join_backoff: 1m
max_join_retries: 10
min_join_backoff: 1s
# auto attempt to rejoin cluster if disconnected, helps prevent split brain
rejoin_interval: 1m

schema_config:
configs:
- from: “2023-03-15”
store: tsdb
object_store: s3
schema: v12
index:
prefix: index_tsdb_
period: 24h

storage_config:
hedging:
at: “250ms”
max_per_second: 20
up_to: 3
tsdb_shipper:
active_index_directory: “{{ loki_storage_dir }}/tsdb-shipper-active”
cache_location: “{{ loki_storage_dir }}/tsdb-shipper-cache”
shared_store: s3

frontend:
log_queries_longer_than: 15s
compress_responses: true

frontend_worker:
frontend_address: frontend.loki.it.ufl.edu:9443
grpc_client_config:
max_send_msg_size: 1.048576e+08
parallelism: 6

query_range:
align_queries_with_step: true
max_retries: 5
cache_results: true
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 2048
ttl: 1h

query_scheduler:
max_outstanding_requests_per_tenant: 42768

limits_config:
enforce_metric_name: false
# Throughput for a tenant/user/org-id per node
ingestion_rate_mb: “{{ _max_tenant_throughput_mb }}”
ingestion_burst_size_mb: “{{ _max_tenant_throughput_burst_mb }}”
per_stream_rate_limit: “{{ _max_tenant_throughput_mb }}MB”
per_stream_rate_limit_burst: “{{ _max_tenant_throughput_burst_mb }}MB”
max_entries_limit_per_query: 100000
max_global_streams_per_user: 20000
retention_period: 2w
query_timeout: 3m
max_cache_freshness_per_query: “10m”
# parallelize queries in 15min intervals
split_queries_by_interval: 15m
# limit how far back we will accept logs
reject_old_samples: true
# Increase maxinum from default 500 to prevent the error maximum of series (500) reached for a single query
max_query_series: 10000
max_query_parallelism: 6

compactor:
working_directory: “{{ loki_storage_dir }}/compactor”
shared_store: s3
compaction_interval: 1m
retention_enabled: true

I’ve enabled debug level logging, I’ll paste what I see.

Honestly your configuration looks pretty good. I would recommend you to remove the TLS configuration and see if it works without TLS (for troubleshooting purpose). If it does then you know where the problem is and it’ll make troubeshooting easier.

Excuse my ignorance, but where amongst the config should I disable TLS?

I would probably start by commenting out the configurations that specify a cert file location or a cert.