Scalability and Optimal Throughput in Loki Benchmarking

Novice in the realm of Loki, I extend my gratitude in advance for your valuable assistance.

I am currently engaged in the deployment of a benchmarking cluster for Loki. In essence, I am encountering a challenge in achieving uniform scalability of throughput. In other words, I am unable to maximize CPU utilization and attain optimal throughput levels.

To provide an overview of the benchmark cluster setup,

I have established a microservices model. This involves the deployment of ‘m’ nodes of distributors (each with 8 cores and 4GB of RAM) and ‘n’ nodes of ingesters (each with 8 cores and 32GB of RAM).

The configuration I have employed is outlined below:

# both ingester and distributor using this identical config, except the `target` field.
target: <ingester/distributor>
auth_enabled: false

http_prefix:
common:
  path_prefix: /data/loki
  replication_factor: 1
  ring:
    kvstore:
      # Backend storage to use for the ring.
      # Supported values are: consul, etcd, inmemory, memberlist, multi. (default "consul")
      store: etcd
      etcd:
        endpoints: ["http://etcd.endpoint:2379"]
server:
  http_listen_address: 0.0.0.0
  grpc_listen_address: 0.0.0.0
  grpc_server_max_recv_msg_size: 41943040
  grpc_server_max_send_msg_size: 41943040
  http_listen_port: 3100
  grpc_listen_port: 9095
  log_level: info
ingester_client:
  remote_timeout: 60s
ingester:
  lifecycler:
    join_after: 10s
    observe_period: 5s
  wal:
    enabled: true
    dir: /data/loki/wal
    flush_on_shutdown: true
  max_chunk_age: 1h
  chunk_retain_period: 30s
  chunk_encoding: snappy
  # sets each chunks to contains 10 block
  chunk_target_size: 10485760
  chunk_block_size: 1048576
  autoforget_unhealthy: true
  concurrent_flushes: 64
schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: aws
      schema: v11
      index:
        prefix: index_
        period: 24h

limits_config:
  ingestion_rate_strategy: local
  ingestion_rate_mb: 1024
  ingestion_burst_size_mb: 2048
  per_stream_rate_limit: 1024MB
  per_stream_rate_limit_burst: 2048MB
  max_global_streams_per_user: 0

storage_config:
  boltdb_shipper:
    shared_store: aws
    shared_store_key_prefix: index/
    active_index_directory: /data/loki/boltdb-shipper-active
    cache_location: /data/loki/boltdb-shipper-cache
  aws:
    bucketnames: <buckentnames>
    endpoint: <endpoint>
    access_key_id: <ak>
    secret_access_key: <sk>
    insecure: true
    s3forcepathstyle: true

table_manager:
  retention_deletes_enabled: true
  retention_period: 48h

On the production side, I am utilizing a cluster of Java applications equipped with metric reporting (utilizing timer metrics for sending HTTP JSON requests) to transmit logs to Loki. In order to avoid the hot spot issue, I have configured the Java applications to send logs with only a single random label(0~32).

Describing the issue at hand:

Initially, with ‘m’ set to 4 and ‘n’ set to 4, I was able to achieve an average send duration of around 150ms per request. The CPU and bandwidth utilization of the Loki cluster remained well within acceptable limits.

Subsequently, as I incrementally increase the number of producer clients, the duration for sending a single request understandably increases in tandem, which is expected behavior.

However, when I attempted to double the number of nodes in the Loki cluster for scaling purposes, I observed that the overall throughput of the cluster did not scale uniformly. Notably, the CPU load of the distributor nodes was only half utilized post-scaling(which suggest the load has been balance over all nodes).

With recognition that providing precise guidance without access to specific cluster details is challenging, I am curious if there are any common issues that merit investigation in this context. Your insights would be greatly appreciated.