Novice in the realm of Loki, I extend my gratitude in advance for your valuable assistance.
I am currently engaged in the deployment of a benchmarking cluster for Loki. In essence, I am encountering a challenge in achieving uniform scalability of throughput. In other words, I am unable to maximize CPU utilization and attain optimal throughput levels.
To provide an overview of the benchmark cluster setup,
I have established a microservices model. This involves the deployment of ‘m’ nodes of distributors (each with 8 cores and 4GB of RAM) and ‘n’ nodes of ingesters (each with 8 cores and 32GB of RAM).
The configuration I have employed is outlined below:
# both ingester and distributor using this identical config, except the `target` field.
target: <ingester/distributor>
auth_enabled: false
path_prefix: /data/loki
replication_factor: 1
# Backend storage to use for the ring.
# Supported values are: consul, etcd, inmemory, memberlist, multi. (default "consul")
store: etcd
endpoints: ["http://etcd.endpoint:2379"]
grpc_server_max_recv_msg_size: 41943040
grpc_server_max_send_msg_size: 41943040
http_listen_port: 3100
grpc_listen_port: 9095
log_level: info
remote_timeout: 60s
join_after: 10s
observe_period: 5s
enabled: true
dir: /data/loki/wal
flush_on_shutdown: true
max_chunk_age: 1h
chunk_retain_period: 30s
chunk_encoding: snappy
# sets each chunks to contains 10 block
chunk_target_size: 10485760
chunk_block_size: 1048576
autoforget_unhealthy: true
concurrent_flushes: 64
- from: 2020-10-24
store: boltdb-shipper
object_store: aws
schema: v11
prefix: index_
period: 24h
ingestion_rate_strategy: local
ingestion_rate_mb: 1024
ingestion_burst_size_mb: 2048
per_stream_rate_limit: 1024MB
per_stream_rate_limit_burst: 2048MB
max_global_streams_per_user: 0
shared_store: aws
shared_store_key_prefix: index/
active_index_directory: /data/loki/boltdb-shipper-active
cache_location: /data/loki/boltdb-shipper-cache
bucketnames: <buckentnames>
endpoint: <endpoint>
access_key_id: <ak>
secret_access_key: <sk>
insecure: true
s3forcepathstyle: true
retention_deletes_enabled: true
retention_period: 48h
On the production side, I am utilizing a cluster of Java applications equipped with metric reporting (utilizing timer metrics for sending HTTP JSON requests) to transmit logs to Loki. In order to avoid the hot spot issue, I have configured the Java applications to send logs with only a single random label(0~32).
Describing the issue at hand:
Initially, with ‘m’ set to 4 and ‘n’ set to 4, I was able to achieve an average send duration of around 150ms per request. The CPU and bandwidth utilization of the Loki cluster remained well within acceptable limits.
Subsequently, as I incrementally increase the number of producer clients, the duration for sending a single request understandably increases in tandem, which is expected behavior.
However, when I attempted to double the number of nodes in the Loki cluster for scaling purposes, I observed that the overall throughput of the cluster did not scale uniformly. Notably, the CPU load of the distributor nodes was only half utilized post-scaling(which suggest the load has been balance over all nodes).
With recognition that providing precise guidance without access to specific cluster details is challenging, I am curious if there are any common issues that merit investigation in this context. Your insights would be greatly appreciated.