Scalability and Optimal Throughput in Loki Benchmarking

Novice in the realm of Loki, I extend my gratitude in advance for your valuable assistance.

I am currently engaged in the deployment of a benchmarking cluster for Loki. In essence, I am encountering a challenge in achieving uniform scalability of throughput. In other words, I am unable to maximize CPU utilization and attain optimal throughput levels.

To provide an overview of the benchmark cluster setup,

I have established a microservices model. This involves the deployment of ‘m’ nodes of distributors (each with 8 cores and 4GB of RAM) and ‘n’ nodes of ingesters (each with 8 cores and 32GB of RAM).

The configuration I have employed is outlined below:

# both ingester and distributor using this identical config, except the `target` field.
target: <ingester/distributor>
auth_enabled: false

http_prefix:
common:
  path_prefix: /data/loki
  replication_factor: 1
  ring:
    kvstore:
      # Backend storage to use for the ring.
      # Supported values are: consul, etcd, inmemory, memberlist, multi. (default "consul")
      store: etcd
      etcd:
        endpoints: ["http://etcd.endpoint:2379"]
server:
  http_listen_address: 0.0.0.0
  grpc_listen_address: 0.0.0.0
  grpc_server_max_recv_msg_size: 41943040
  grpc_server_max_send_msg_size: 41943040
  http_listen_port: 3100
  grpc_listen_port: 9095
  log_level: info
ingester_client:
  remote_timeout: 60s
ingester:
  lifecycler:
    join_after: 10s
    observe_period: 5s
  wal:
    enabled: true
    dir: /data/loki/wal
    flush_on_shutdown: true
  max_chunk_age: 1h
  chunk_retain_period: 30s
  chunk_encoding: snappy
  # sets each chunks to contains 10 block
  chunk_target_size: 10485760
  chunk_block_size: 1048576
  autoforget_unhealthy: true
  concurrent_flushes: 64
schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: aws
      schema: v11
      index:
        prefix: index_
        period: 24h

limits_config:
  ingestion_rate_strategy: local
  ingestion_rate_mb: 1024
  ingestion_burst_size_mb: 2048
  per_stream_rate_limit: 1024MB
  per_stream_rate_limit_burst: 2048MB
  max_global_streams_per_user: 0

storage_config:
  boltdb_shipper:
    shared_store: aws
    shared_store_key_prefix: index/
    active_index_directory: /data/loki/boltdb-shipper-active
    cache_location: /data/loki/boltdb-shipper-cache
  aws:
    bucketnames: <buckentnames>
    endpoint: <endpoint>
    access_key_id: <ak>
    secret_access_key: <sk>
    insecure: true
    s3forcepathstyle: true

table_manager:
  retention_deletes_enabled: true
  retention_period: 48h

On the production side, I am utilizing a cluster of Java applications equipped with metric reporting (utilizing timer metrics for sending HTTP JSON requests) to transmit logs to Loki. In order to avoid the hot spot issue, I have configured the Java applications to send logs with only a single random label(0~32).

Describing the issue at hand:

Initially, with ‘m’ set to 4 and ‘n’ set to 4, I was able to achieve an average send duration of around 150ms per request. The CPU and bandwidth utilization of the Loki cluster remained well within acceptable limits.

Subsequently, as I incrementally increase the number of producer clients, the duration for sending a single request understandably increases in tandem, which is expected behavior.

However, when I attempted to double the number of nodes in the Loki cluster for scaling purposes, I observed that the overall throughput of the cluster did not scale uniformly. Notably, the CPU load of the distributor nodes was only half utilized post-scaling(which suggest the load has been balance over all nodes).

With recognition that providing precise guidance without access to specific cluster details is challenging, I am curious if there are any common issues that merit investigation in this context. Your insights would be greatly appreciated.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.