How to improve Loki Performance

Hello Loki Team!

I would like some help tweaking the performance of our Loki setup. I’ve read blog posts about blazing speeds but can’t seem to achieve this. The read path is what I am trying to optimize.

The current setup is Loki 2.0 in microservices mode deployed to AWS using S3 and DynamoDB for chunks/index. We are using memberlist for the Ring. ElasticCache Redis has been added as the cache.

We are running 2 query frontends, 3 distributors, 3 ingesters, 6 queriers and 1 table manager containers. These all run in ECS Fargate which maybe part of the problem. chunk_retain_period is very low because Fargate does not have persistent storage. My thought is the queriers have to pull everything from S3, but that would be the case for any query greater than your chunk_retain_period. The above mentioned blog post is run over a 1 hour period, perhaps these chunks had not been flushed? Served from ingester on persistent storage rather than from S3?

Anyway, please see my benchmarks below, and config further down.

Some benchmarking… (only queriers were scaled out)

6 Queriers over 48h of logs just straight up timed out.

12 Queriers…
time ./logcli query '{Environment=“production”,Application=“test-app”,Deployment=“production”,Service=“test-service”,LogGroup=“test-service-logs”} |~ “(?i)pppppppppppp” ’ --since=48h --stats

Ingester.TotalReached 12
Ingester.TotalChunksMatched 72
Ingester.TotalBatches 0
Ingester.TotalLinesSent 0
Ingester.HeadChunkBytes 923 kB
Ingester.HeadChunkLines 755
Ingester.DecompressedBytes 85 MB
Ingester.DecompressedLines 63807
Ingester.CompressedBytes 14 MB
Ingester.TotalDuplicates 0
Store.TotalChunksRef 8144
Store.TotalChunksDownloaded 8144
Store.ChunksDownloadTime 8m38.330467263s
Store.HeadChunkBytes 0 B
Store.HeadChunkLines 0
Store.DecompressedBytes 59 GB
Store.DecompressedLines 44850257
Store.CompressedBytes 9.6 GB
Store.TotalDuplicates 0
Summary.BytesProcessedPerSecond 375 MB
Summary.LinesProcessedPerSecond 285792
Summary.TotalBytesProcessed 59 GB
Summary.TotalLinesProcessed 44914819
Summary.ExecTime 2m37.158656521s

real 2m37.267s
user 0m0.119s
sys 0m0.016s

24 Queriers…
Ingester.TotalReached 12
Ingester.TotalChunksMatched 56
Ingester.TotalBatches 0
Ingester.TotalLinesSent 0
Ingester.HeadChunkBytes 1.0 MB
Ingester.HeadChunkLines 841
Ingester.DecompressedBytes 84 MB
Ingester.DecompressedLines 63012
Ingester.CompressedBytes 14 MB
Ingester.TotalDuplicates 0
Store.TotalChunksRef 8122
Store.TotalChunksDownloaded 8122
Store.ChunksDownloadTime 5m31.764491701s
Store.HeadChunkBytes 0 B
Store.HeadChunkLines 0
Store.DecompressedBytes 59 GB
Store.DecompressedLines 44704040
Store.CompressedBytes 9.6 GB
Store.TotalDuplicates 0
Summary.BytesProcessedPerSecond 500 MB
Summary.LinesProcessedPerSecond 380884
Summary.TotalBytesProcessed 59 GB
Summary.TotalLinesProcessed 44767893
Summary.ExecTime 1m57.536630781s

real 1m57.730s
user 0m0.104s
sys 0m0.009s

There are some gains here, but I am still not seeing the performance showcased in the blog. I have the /metrics endpoint being ingested if you would like to see something specific.

Config below, LOKI_ prefixed vars are replaced on deploy.

auth_enabled: false

server:
  http_listen_address: 0.0.0.0
  http_listen_port: LOKI_HTTP_LISTEN_PORT
  grpc_listen_address: 0.0.0.0
  grpc_listen_port: LOKI_GRPC_LISTEN_PORT
  http_server_read_timeout: 3m
  http_server_write_timeout: 3m
  log_level: LOKI_LOG_LEVEL

memberlist:
  bind_port: LOKI_HTTP_MEMBERLIST_LISTEN_PORT

  join_members:
    - dns+LOKI_QUERIER_RECORD_NAME.LOKI_SERVICE_DISCOVERY_NAMESPACE:LOKI_HTTP_MEMBERLIST_LISTEN_PORT
    - dns+LOKI_DISTRIBUTOR_RECORD_NAME.LOKI_SERVICE_DISCOVERY_NAMESPACE:LOKI_HTTP_MEMBERLIST_LISTEN_PORT
    - dns+LOKI_INGESTER_RECORD_NAME.LOKI_SERVICE_DISCOVERY_NAMESPACE:LOKI_HTTP_MEMBERLIST_LISTEN_PORT

  max_join_backoff: 1m
  max_join_retries: 10
  min_join_backoff: 1s
  dead_node_reclaim_time: 30s
  gossip_to_dead_nodes_time: 15s
  left_ingesters_timeout: 30s

ingester:
  lifecycler:
    join_after: 60s
    observe_period: 5s
    interface_names:
      - "eth1"

    ring:
      kvstore:
        store: memberlist
      replication_factor: 3

    final_sleep: 0s

  chunk_idle_period: 15m
  max_chunk_age: 1h
  chunk_retain_period: 30s
  max_transfer_retries: 0
  chunk_target_size: 1536000
  chunk_block_size: 262144

querier:
  query_timeout: 2m
  query_ingesters_within: 2h

query_range:
  split_queries_by_interval: 30m
  align_queries_with_step: true
  max_retries: 5
  parallelise_shardable_queries: true
  cache_results: true

  results_cache:
    cache:
      redis:
        endpoint: LOKI_REDIS_ENDPOINT
        timeout: 1s
        db: 0

schema_config:
  configs:
    - from: 2020-09-01
      store: aws
      object_store: aws
      schema: v11
      index:
        prefix: LOKI_LOG_TABLE_PREFIX

        period: 168h

storage_config:
  aws:
    s3: s3://LOKI_AWS_REGION/LOKI_LOG_BUCKET_NAME

    dynamodb:
      dynamodb_url: dynamodb://LOKI_AWS_REGION

  index_cache_validity: 14m
  index_queries_cache_config:
    redis:
      endpoint: LOKI_REDIS_ENDPOINT
      timeout: 1s
      db: 1

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  ingestion_rate_mb: 16
  ingestion_burst_size_mb: 24
  max_entries_limit_per_query: 50000
  max_query_parallelism: 12

chunk_store_config:
  chunk_cache_config:
    redis:
      endpoint: LOKI_REDIS_ENDPOINT
      timeout: 1s
      db: 2

  write_dedupe_cache_config:
    redis:
      endpoint: LOKI_REDIS_ENDPOINT
      timeout: 1s
      db: 3

  cache_lookups_older_than: 36h
  max_look_back_period: 672h

table_manager:
  index_tables_provisioning:
    enable_ondemand_throughput_mode: true
    enable_inactive_throughput_on_demand_mode: true

  retention_deletes_enabled: true
  retention_period: 672h

frontend:
  log_queries_longer_than: 5s
  downstream_url: https://LOKI_QUERIER_RECORD_NAME.LOKI_PRIVATE_ZONE_DOMAIN_NAME
  compress_responses: true
  max_outstanding_per_tenant: 3600
2 Likes