Novice in the realm of Loki, I extend my gratitude in advance for your valuable assistance.
I am currently engaged in the deployment of a benchmarking cluster for Loki. In essence, I am encountering a challenge in achieving uniform scalability of throughput. In other words, I am unable to maximize CPU utilization and attain optimal throughput levels.
I have established a microservices model. This involves the deployment of ‘m’ nodes of distributors (each with 8 cores and 4GB of RAM) and ‘n’ nodes of ingesters (each with 8 cores and 32GB of RAM).
The configuration I have employed is outlined below:
# both ingester and distributor using this identical config, except the `target` field. target: <ingester/distributor> auth_enabled: false http_prefix: common: path_prefix: /data/loki replication_factor: 1 ring: kvstore: # Backend storage to use for the ring. # Supported values are: consul, etcd, inmemory, memberlist, multi. (default "consul") store: etcd etcd: endpoints: ["http://etcd.endpoint:2379"] server: http_listen_address: 0.0.0.0 grpc_listen_address: 0.0.0.0 grpc_server_max_recv_msg_size: 41943040 grpc_server_max_send_msg_size: 41943040 http_listen_port: 3100 grpc_listen_port: 9095 log_level: info ingester_client: remote_timeout: 60s ingester: lifecycler: join_after: 10s observe_period: 5s wal: enabled: true dir: /data/loki/wal flush_on_shutdown: true max_chunk_age: 1h chunk_retain_period: 30s chunk_encoding: snappy # sets each chunks to contains 10 block chunk_target_size: 10485760 chunk_block_size: 1048576 autoforget_unhealthy: true concurrent_flushes: 64 schema_config: configs: - from: 2020-10-24 store: boltdb-shipper object_store: aws schema: v11 index: prefix: index_ period: 24h limits_config: ingestion_rate_strategy: local ingestion_rate_mb: 1024 ingestion_burst_size_mb: 2048 per_stream_rate_limit: 1024MB per_stream_rate_limit_burst: 2048MB max_global_streams_per_user: 0 storage_config: boltdb_shipper: shared_store: aws shared_store_key_prefix: index/ active_index_directory: /data/loki/boltdb-shipper-active cache_location: /data/loki/boltdb-shipper-cache aws: bucketnames: <buckentnames> endpoint: <endpoint> access_key_id: <ak> secret_access_key: <sk> insecure: true s3forcepathstyle: true table_manager: retention_deletes_enabled: true retention_period: 48h
On the production side, I am utilizing a cluster of Java applications equipped with metric reporting (utilizing timer metrics for sending HTTP JSON requests) to transmit logs to Loki. In order to avoid the hot spot issue, I have configured the Java applications to send logs with only a single random label(0~32).
Initially, with ‘m’ set to 4 and ‘n’ set to 4, I was able to achieve an average send duration of around 150ms per request. The CPU and bandwidth utilization of the Loki cluster remained well within acceptable limits.
Subsequently, as I incrementally increase the number of producer clients, the duration for sending a single request understandably increases in tandem, which is expected behavior.
However, when I attempted to double the number of nodes in the Loki cluster for scaling purposes, I observed that the overall throughput of the cluster did not scale uniformly. Notably, the CPU load of the distributor nodes was only half utilized post-scaling(which suggest the load has been balance over all nodes).
With recognition that providing precise guidance without access to specific cluster details is challenging, I am curious if there are any common issues that merit investigation in this context. Your insights would be greatly appreciated.