Loki Querier Load-balancing Issues

We are seeing a constant issue where one read node is being favored in our simple scalable deployment. If we restart the cluster another node will take it’s place. The other nodes are being used but it significantly favors only one read node. Write path is unaffected.

Configuration is as follows:
Deployment mode=simple scalable
Loki version=2.9.4
Read nodes=3
Write nodes=3

Both read and write paths are behind an Nginx reverse-proxy for load-balancing.

Please help us to determine how to get our read path properly balanced…

Loki Configuration:

loki_version: “2.9.4”

loki_http_port: 8443
loki_grpc_port: 9443
loki_system_user: loki
loki_system_group: loki
loki_config_dir: /etc/loki
loki_storage_dir: /loki

loki_auth_enabled: true
_max_tenant_throughput_mb: 40
_max_tenant_throughput_burst_mb: 60
_max_query_timeout: 600

loki_config:
common:
replication_factor: 3
ring:
kvstore:
store: memberlist
heartbeat_timeout: 10m
storage:
s3:
bucketnames: loki
endpoint: object-test.ceph
region: default
access_key_id: OMITTED
secret_access_key: OMITTED
insecure: false
s3forcepathstyle: true
http_config:
insecure_skip_verify: true
server:
log_level: info
http_listen_port: “8443”
http_tls_config:
cert_file: “{{ loki_config_dir }}/ssl/cert.crt”
key_file: “{{ loki_config_dir }}/ssl/cert.key”
http_server_read_timeout: “{{ _max_query_timeout + 10 }}s”
http_server_write_timeout: “{{ _max_query_timeout + 10 }}s”

grpc_listen_port: "{{ loki_grpc_port }}"
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 104857600
grpc_server_max_concurrent_streams: 1000

ingester:
chunk_idle_period: 1h
max_chunk_age: 2h
flush_check_period: 10s
wal:
replay_memory_ceiling: “{{ (ansible_memtotal_mb * 0.75) | int }}MB”

querier:
multi_tenant_queries_enabled: true
max_concurrent: 3

memberlist:
abort_if_cluster_join_fails: false
bind_port: 7946
join_members: “{{ groups[‘loki’] }}”
max_join_backoff: 1m
max_join_retries: 10
min_join_backoff: 1s
rejoin_interval: 1m

schema_config:
configs:
- from: 2020-05-15
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h

  - from: "2023-01-30"
    store: tsdb
    object_store: s3
    schema: v12
    index:
      prefix: index_tsdb_
      period: 24h

storage_config:
hedging:
at: “250ms”
max_per_second: 20
up_to: 3

boltdb_shipper:
  active_index_directory: "storage/boltdb-shipper-active"
  cache_location: "storage/boltdb-shipper-cache"
  cache_ttl: 24h # Can be increased for faster performance over longer query periods, uses more disk space
  shared_store: s3

tsdb_shipper:
  active_index_directory: "storage/tsdb-shipper-active"
  cache_location: "storage/tsdb-shipper-cache"
  shared_store: s3

frontend:
log_queries_longer_than: 15s
compress_responses: true

query_range:
align_queries_with_step: true
max_retries: 5
cache_results: true
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 2048
ttl: 1h

query_scheduler:
use_scheduler_ring: true
scheduler_ring:
kvstore:
store: memberlist

limits_config:
enforce_metric_name: false
ingestion_rate_mb: “{{ _max_tenant_throughput_mb }}”
ingestion_burst_size_mb: “{{ _max_tenant_throughput_burst_mb }}”
stream
per_stream_rate_limit: “{{ _max_tenant_throughput_mb }}MB”
per_stream_rate_limit_burst: “{{ _max_tenant_throughput_burst_mb }}MB”
max_entries_limit_per_query: 100000

max_global_streams_per_user: 20000
retention_period: 2w
query_timeout: "{{ _max_query_timeout }}s"
max_cache_freshness_per_query: "10m"
split_queries_by_interval: 15m
reject_old_samples: true
max_query_series: 10000
max_query_parallelism: 32

compactor:
working_directory: “storeage/compactor”
shared_store: s3
compaction_interval: 1m
retention_enabled: true

I don’t think Loki read path does internal routing. I’d suggest reviewing your Nginx configuration, and perhaps try different load balancing mode (perhaps try least connection).

That said, couple of things that may be worth looking into:

  1. You don’t seem to have query frontend configured even though you have query splitting (neither downstream_url nor worker address is configured). Unless you are passing it in as environment variable.

  2. How are you determining one node being favored? Number of active connections? Number of backend recorded from Nginx? Resource consumption?

Thanks for your response.

  1. Would removing this configuration help? We had not fully configured the query frontend, and were unsure how to in a simple scalable that wasn’t K8s.

  2. We are running the Grafana-agent on the VMs and scraping Prometheus metrics. We can see which ingester is taking the brunt of the logs.

Are you referring to the query splitting? That’s your decision. In my opinion, if you have a reasonably sizable log volume, and you at least sometimes query for more than a couple of hours of interval, then you should enable query splitting.

It’s pretty easy to configure query frontend, see Query frontend example | Grafana Loki documentation. I would recommend using the pull method (by configuration frontend_worker.frontend_address), and since you are using simple scalable mode the frontend_address would be the service name URL for your read path.

Your original question was about querier, or am I mistaken?

Regardless, uneven utilization for the ingesters are generally expected. There are ways to deal with it (I haven’t tried), see Automatic stream sharding | Grafana Loki documentation.

Sorry that was my mistake, I meant to put querier not ingester. I will give the query frontend configuration a shot. Thanks!

frontend:
log_queries_longer_than: 15s
compress_responses: true

frontend_worker:
frontend_address: read.test.loki.it.ufl.edu
grpc_client_config:
max_send_msg_size: 1.048576e+08
parallelism: 10

query_range:
align_queries_with_step: true
max_retries: 5
cache_results: true
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 2048
ttl: 1h

query_scheduler:
# Needed to avoid getting “429 too many requests” errors on when querying
# value from TSDB | Grafana Loki documentation
max_outstanding_requests_per_tenant: 32768
use_scheduler_ring: true
scheduler_ring:
kvstore:
store: memberlist

I’m using this configuration right now and unfortunately still seeing one querier being used significantly more. Any adjustments that could be made?

Double check the logs from your querier and make sure it’s actually working.

The frontend_address should be the internal service discovery name with a gRCP port, meaning this should not be going through load balancer. Here is what our frontend_worker configuration looks like (loki-read.services.internal is our service discovery record for the read containers):

frontend_worker:
  frontend_address: loki-read.services.internal:9095
  grpc_client_config:
    max_recv_msg_size: 1.048576e+08
    max_send_msg_size: 1.048576e+08
  parallelism: 2

Here are the logs from the start of the service (our grpc port # is 9443) … I do not see any internal service name:

Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 systemd[1]: Starting Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus…
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742104]: level=info ts=2024-02-07T21:09:47.447206121Z caller=main.go:73 msg=“config is valid”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 systemd[1]: Started Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus.
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.491125781Z caller=main.go:108 msg=“Starting Loki” version=“(version=2.9.4, branch=HEAD, revision=f599ebc535)”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.49282129Z caller=server.go:322 http=[::]:8443 grpc=[::]:9443 msg=“server listening on addresses”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.493348247Z caller=memberlist_client.go:434 msg=“Using memberlist cluster label and node name” cluster_label= node=az1-irs-o11y-test-loki-read-01.server.ufl.edu-924cde6b
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.494083497Z caller=memberlist_client.go:540 msg=“memberlist fast-join starting” nodes_found=6 to_join=4
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=warn ts=2024-02-07T21:09:47.494793315Z caller=experimental.go:20 msg=“experimental feature in use” feature=“In-memory (FIFO) cache - frontend.embedded-cache”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=warn ts=2024-02-07T21:09:47.495647016Z caller=cache.go:127 msg=“fifocache config is deprecated. use embedded-cache instead”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=warn ts=2024-02-07T21:09:47.495658239Z caller=experimental.go:20 msg=“experimental feature in use” feature=“In-memory (FIFO) cache - chunksembedded-cache”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.496871417Z caller=table_manager.go:271 index-store=boltdb-shipper-2020-05-15 msg=“query readiness setup completed” duration=1.318µs distinct_users_len=0 distinct_users=
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.496909399Z caller=shipper.go:165 index-store=boltdb-shipper-2020-05-15 msg=“starting index shipper in RO mode”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.49691706Z caller=shipper_index_client.go:76 index-store=boltdb-shipper-2020-05-15 msg=“starting boltdb shipper in RO mode”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.497285613Z caller=table_manager.go:271 index-store=tsdb-2023-01-30 msg=“query readiness setup completed” duration=651ns distinct_users_len=0 distinct_users=
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.497295087Z caller=shipper.go:165 index-store=tsdb-2023-01-30 msg=“starting index shipper in RO mode”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.49856141Z caller=mapper.go:47 msg=“cleaning up mapped rules directory” path=/rules
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.499802647Z caller=worker.go:112 msg=“Starting querier worker using query-scheduler and scheduler ring for addresses”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.502809408Z caller=module_service.go:82 msg=initialising module=cache-generation-loader
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.502802431Z caller=module_service.go:82 msg=initialising module=server
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.502860803Z caller=module_service.go:82 msg=initialising module=query-frontend-tripperware
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.502874263Z caller=module_service.go:82 msg=initialising module=memberlist-kv
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.502883521Z caller=module_service.go:82 msg=initialising module=index-gateway-ring
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.502883991Z caller=module_service.go:82 msg=initialising module=analytics
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.50295014Z caller=module_service.go:82 msg=initialising module=ring
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.50296528Z caller=module_service.go:82 msg=initialising module=query-scheduler-ring
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.503256809Z caller=module_service.go:82 msg=initialising module=compactor
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.503651651Z caller=memberlist_client.go:560 msg=“memberlist fast-join finished” joined_nodes=4 elapsed_time=9.570666ms
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.503668947Z caller=memberlist_client.go:573 msg=“joining memberlist cluster” join_members=az1-irs-o11y-test-loki-write-01.server.ufl.edu,az1-irs-o11y-test-loki-write-02.server.ufl.edu,az2-irs-o11y-test-loki-write-01.server.ufl.edu,az1-irs-o11y-test-loki-read-01.server.ufl.edu,az1-irs-o11y-test-loki-read-02.server.ufl.edu,az2-irs-o11y-test-loki-read-01.server.ufl.edu
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.503799864Z caller=basic_lifecycler.go:297 msg=“instance not found in the ring” instance=az1-irs-o11y-test-loki-read-01.server.ufl.edu ring=compactor
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.503816733Z caller=basic_lifecycler_delegates.go:63 msg=“not loading tokens from file, tokens file path is empty”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.504067015Z caller=basic_lifecycler.go:297 msg=“instance not found in the ring” instance=az1-irs-o11y-test-loki-read-01.server.ufl.edu ring=scheduler
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.504089442Z caller=basic_lifecycler_delegates.go:63 msg=“not loading tokens from file, tokens file path is empty”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.504097165Z caller=compactor.go:395 msg=“waiting until compactor is JOINING in the ring”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.504391543Z caller=module_service.go:82 msg=initialising module=ingester-querier
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.504421073Z caller=ringmanager.go:201 msg=“waiting until scheduler is JOINING in the ring”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.504430489Z caller=ringmanager.go:205 msg=“scheduler is JOINING in the ring”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.504453871Z caller=basic_lifecycler.go:297 msg=“instance not found in the ring” instance=az1-irs-o11y-test-loki-read-01.server.ufl.edu ring=index-gateway
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.504464865Z caller=basic_lifecycler_delegates.go:63 msg=“not loading tokens from file, tokens file path is empty”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.504603657Z caller=ringmanager.go:184 msg=“waiting until index gateway is JOINING in the ring”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.517281453Z caller=memberlist_client.go:592 msg=“joining memberlist cluster succeeded” reached_nodes=6 elapsed_time=13.612804ms
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.622933779Z caller=ringmanager.go:188 msg=“index gateway is JOINING in the ring”
Feb 7 21:09:47 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:47.677168869Z caller=compactor.go:399 msg=“compactor is JOINING in the ring”
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.504817277Z caller=ringmanager.go:214 msg=“waiting until scheduler is ACTIVE in the ring”
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.623092318Z caller=ringmanager.go:197 msg=“waiting until index gateway is ACTIVE in the ring”
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.629164242Z caller=ringmanager.go:218 msg=“scheduler is ACTIVE in the ring”
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.629210167Z caller=module_service.go:82 msg=initialising module=query-frontend
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.629219337Z caller=module_service.go:82 msg=initialising module=query-scheduler
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.677288124Z caller=compactor.go:409 msg=“waiting until compactor is ACTIVE in the ring”
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.784443073Z caller=compactor.go:413 msg=“compactor is ACTIVE in the ring”
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.803650571Z caller=ringmanager.go:201 msg=“index gateway is ACTIVE in the ring”
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.803742624Z caller=module_service.go:82 msg=initialising module=store
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.803762545Z caller=module_service.go:82 msg=initialising module=ruler
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.803760735Z caller=module_service.go:82 msg=initialising module=querier
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.803777538Z caller=ruler.go:528 msg=“ruler up and running”
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.803823074Z caller=module_service.go:82 msg=initialising module=index-gateway
Feb 7 21:09:48 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:48.803884158Z caller=loki.go:505 msg=“Loki started”
Feb 7 21:09:51 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:51.804465886Z caller=worker.go:209 msg=“adding connection” addr=10.51.32.18:9443
Feb 7 21:09:51 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:51.804636665Z caller=worker.go:209 msg=“adding connection” addr=10.51.157.9:9443
Feb 7 21:09:51 az1-irs-o11y-test-loki-read-01 loki[742113]: level=warn ts=2024-02-07T21:09:51.804714223Z caller=worker.go:254 msg=“max concurrency is not evenly divisible across targets, adding an extra connection” addr=10.51.32.18:9443
Feb 7 21:09:51 az1-irs-o11y-test-loki-read-01 loki[742113]: level=warn ts=2024-02-07T21:09:51.804795569Z caller=scheduler_processor.go:98 msg=“error contacting scheduler” err=“rpc error: code = Canceled desc = context canceled” addr=10.51.32.18:9443
Feb 7 21:09:58 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:58.630124341Z caller=frontend_scheduler_worker.go:107 msg=“adding connection to scheduler” addr=10.51.32.18:9443
Feb 7 21:09:58 az1-irs-o11y-test-loki-read-01 loki[742113]: level=info ts=2024-02-07T21:09:58.630625474Z caller=frontend_scheduler_worker.go:107 msg=“adding connection to scheduler” addr=10.51.157.9:9443

Hi I was just checking back in to see if you could help me identify what my frontend_address should be.

Thanks!

frontend_address should be the service discovery name for your reader container if using simple scalable mode, or query frontend if using micro service mode.

Remember that the querier needs to able to connect to the query frontend on gRPC port, so it needs to be internal and not through frontend load balancer.

OK do you mind providing a little additional advice on that? Here is how the data flow is happening in my environment:

Log data (Grafana-agent) → Nginx reverse proxy for tenant auth (single DNS name) → read path (single DNS name pointing to nginx reverse proxy load-balancer) → multiple VMs running Loki process.

I am thinking that it would be the loki read path DNS name with the gRPC port but I could be wrong.

It would be similar to how your membership ring is configured. It needs to be before your nginx reverse proxy. If you are using Kubernetes it would be the service discovery name.

I think the confusing part here is that I’m not using Kubernetes, so if it’s before the reverse proxy it would just be the DNS name that takes it to the reverse proxy.

In your memberlist configuratino you have:

What does the groups['loki'] value expand to? Are you running simple scalable mode?

So that’s an Ansible group that expands to all of the Loki members (read and write) which are VMs running the Loki binary … once they pass through the reverse proxy, they are given either a read or write header depending on the path and sent to groups of read or write nodes.

Then that’s probably what you want. but only the read part. Try this:

  1. Create a DNS A record with multiple values that point to your read instances only.

  2. Configure your frontend address with that DNS record (with a port 9095, see my example above).

See if that works. If that works, then you can worry about how to automate the DNS record creation according to the number of read instances provisioned.

Hello again:

So I’ve configured the DNS record to point to the (3) read node IPs.
We are using a custom port for grpc.

frontend_worker:
frontend_address: frontend.test.loki.it.ufl.edu:9443
grpc_client_config:
max_send_msg_size: 1.048576e+08
parallelism: 10

The following errors pop up:
Feb 19 09:47:58 az1-irs-o11y-test-loki-read-01.server.ufl.edu loki[1651031]: level=error ts=2024-02-19T14:47:58.224711178Z caller=frontend_processor.go:63 msg=“error contacting frontend” address=10.51.32.20:9443 err=“rpc error: code = Canceled desc = context canceled”
Feb 19 09:47:58 az1-irs-o11y-test-loki-read-01.server.ufl.edu loki[1651031]: level=error ts=2024-02-19T14:47:58.224763614Z caller=frontend_processor.go:63 msg=“error contacting frontend” address=10.51.32.20:9443 err=“rpc error: code = Canceled desc = context canceled”

I noticed when looking at netstat that they seem to be listening on ipv6 but not ipv4 which is odd since we aren’t using ipv6.

Looks like your DNS record resolves to 10.51.32.20 (or at least one of them). Are you able to telnet to that IP on port 9443 from your reader instances?

root@az1-irs-o11y-test-loki-read-01:/etc/loki# telnet 10.51.32.18 9443
Trying 10.51.32.18…
Connected to 10.51.32.18.
Escape character is ‘^]’.
@▒xterm-256color

Looks to be successfully able to telnet over that port.

Interesting. I’d double check and make sure that query-frontend is indeed running on that instance. You can hit the /config endpoint to double check the configuration present on a particular instance. Also check query-frontend logs and see if there is any error there, enable debug if needed.