Hi, I am trying to deploy Simple Scalable Distributed Loki on AWS Elastic Container Service (ECS). Currently, I have three autoscaling groups of read, write, and backend targets, and all of them are hidden behind an NGINX server reverse-proxying all the push/query requests.
I have adapted the configuration found here, which deploys Loki SSD using docker-compose (so all containers under the same network). All targets are shipped with the same config below, with S3 as storage backend and consul kv-store for discovery & hash rings.
While low-volume writes seems mostly OK, when I load-tested the system with slight bigger write-only loads. The write targets has the following errors.
level=error ts=2024-08-01T00:00:00.0000000Z caller=client.go:243 msg="error getting path" key=loki/ring err="Get \"<consul_server_hostname:8500>/v1/kv/loki/ring?index=286535604&stale=&wait=10000ms\": context canceled"
level=error ts=2024-08-01T00:00:00.0000000Z caller=ratestore.go:303 msg="unable to get stream rates from ingester" ingester=10.0.0.0:9095 err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.0.0:9095: connect: connection refused\""
These two errors pops up a lot when heavy writes occur. However, read
and backend
targets don’t seem to have any errors at all. On consul server, I could see that values are being read/written correctly by Loki services. As for the “unable to get stream rates from ingester” error, I am not sure why this happens because it’s coming from write
target and it contains both a ingester and a distributor, why would it trying to connect to the ingester?
target: "${GF_LOKI_TARGET}"
auth_enabled: false
server:
http_listen_address: 0.0.0.0
grpc_listen_address: 0.0.0.0
http_listen_port: 3100
grpc_listen_port: 9095
log_level: debug
common:
path_prefix: /loki
ring:
kvstore:
store: consul
prefix: loki/
consul:
host: <consul_server_hostname:8500>
compactor_address: "${GF_LOKI_BACKEND_HTTP}"
compactor_grpc_address: "${GF_LOKI_BACKEND_GRPC}"
storage_config:
index_cache_validity: 5m
tsdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/index_cache
aws:
s3: "s3://${GF_LOKI_STORAGE_REGION}/${GF_LOKI_STORAGE_NAME}"
s3forcepathstyle: true
ingester:
chunk_retain_period: 6m # should be higher than index_cache_validity
chunk_idle_period: 10m
max_chunk_age: 10m
chunk_encoding: snappy
flush_op_timeout: 1m
schema_config:
configs:
- from: 2024-06-03
store: tsdb
object_store: s3
schema: v13
index:
prefix: index_
period: 24h
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 90m
ingestion_rate_mb: 50
ingestion_burst_size_mb: 50
discover_service_name: []
# parallelize queries in 15min intervals
split_queries_by_interval: 15m
query_range:
# make queries more cache-able by aligning them with their step intervals
align_queries_with_step: true
cache_results: true
frontend:
log_queries_longer_than: 5s
query_scheduler:
max_outstanding_requests_per_tenant: 1024
querier:
query_ingesters_within: 2h
compactor:
retention_enabled: false
working_directory: /tmp/loki/compactor