Grafana Loki write target errors on ratestore

Hi, I am trying to deploy Simple Scalable Distributed Loki on AWS Elastic Container Service (ECS). Currently, I have three autoscaling groups of read, write, and backend targets, and all of them are hidden behind an NGINX server reverse-proxying all the push/query requests.

I have adapted the configuration found here, which deploys Loki SSD using docker-compose (so all containers under the same network). All targets are shipped with the same config below, with S3 as storage backend and consul kv-store for discovery & hash rings.

While low-volume writes seems mostly OK, when I load-tested the system with slight bigger write-only loads. The write targets has the following errors.

level=error ts=2024-08-01T00:00:00.0000000Z caller=client.go:243 msg="error getting path" key=loki/ring err="Get \"<consul_server_hostname:8500>/v1/kv/loki/ring?index=286535604&stale=&wait=10000ms\": context canceled"

level=error ts=2024-08-01T00:00:00.0000000Z caller=ratestore.go:303 msg="unable to get stream rates from ingester" ingester=10.0.0.0:9095 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.0.0:9095: connect: connection refused\""

These two errors pops up a lot when heavy writes occur. However, read and backend targets don’t seem to have any errors at all. On consul server, I could see that values are being read/written correctly by Loki services. As for the “unable to get stream rates from ingester” error, I am not sure why this happens because it’s coming from write target and it contains both a ingester and a distributor, why would it trying to connect to the ingester?

target: "${GF_LOKI_TARGET}"
auth_enabled: false

server:
  http_listen_address: 0.0.0.0
  grpc_listen_address: 0.0.0.0
  http_listen_port: 3100
  grpc_listen_port: 9095
  log_level: debug

common:
  path_prefix: /loki
  ring:
    kvstore:
      store: consul
      prefix: loki/
      consul:
        host: <consul_server_hostname:8500>
  compactor_address: "${GF_LOKI_BACKEND_HTTP}"
  compactor_grpc_address: "${GF_LOKI_BACKEND_GRPC}"

storage_config:
  index_cache_validity: 5m
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/index_cache
  aws:
    s3: "s3://${GF_LOKI_STORAGE_REGION}/${GF_LOKI_STORAGE_NAME}"
    s3forcepathstyle: true

ingester:
  chunk_retain_period: 6m # should be higher than index_cache_validity
  chunk_idle_period: 10m
  max_chunk_age: 10m
  chunk_encoding: snappy
  flush_op_timeout: 1m

schema_config:
  configs:
    - from: 2024-06-03
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 90m
  ingestion_rate_mb: 50
  ingestion_burst_size_mb: 50
  discover_service_name: []
  # parallelize queries in 15min intervals
  split_queries_by_interval: 15m

query_range:
  # make queries more cache-able by aligning them with their step intervals
  align_queries_with_step: true
  cache_results: true

frontend:
  log_queries_longer_than: 5s

query_scheduler:
  max_outstanding_requests_per_tenant: 1024

querier:
  query_ingesters_within: 2h

compactor:
  retention_enabled: false
  working_directory: /tmp/loki/compactor

After some research, I realized the “context canceled” may due to some canceled requests and maybe it’s not an indication of something wrong? (given that the consul kv-store are being written correctly)

However, still experiencing a lot of errors about “Unable to get stream rates from ingester” from target=write services. Can someone help?

  1. Your first error may be an indication that your consul instance/cluster is overwhelmed.
  2. In a Loki cluster it’s important for all components to be able to communicate with each other, especially from read targets to write targets. So you’d want to make sure that your containers have their own IPs (AWSVPC mode instead of bridge mode), and manually try to hit one container from anther and see what happens.
1 Like

Thanks for the reply! I am using host mode for containers, so each target should be able to use host IP and individual ports to communicate. I could see in /ring endpoints that instances are able to send heartbeats using gRPC. It’s only when I load-test the Loki SSD with massive writes in a short period of time that those “Unable to get stream rates from ingester” occurs. I think this problem is specific to write targets querying for stream rates from itself or peer write targets (since ingester and distributor/ratestore are bundled inside write targets)

If you are hitting errors while load testing, try increasing grpc message size a bit and see if that helps:

server:
  # 100MB
  grpc_server_max_recv_msg_size: 1.048576e+08
  grpc_server_max_send_msg_size: 1.048576e+08