Improving Performance in Loki System for Production Use

Hello Loki Team,

I am currently running a Loki setup and I am seeking guidance on how to improve its performance. Here are the details of my system configuration, setup, and the issue I am experiencing:

System Configuration:

  • Server: 16 GB RAM, 16 core CPU
  • S3 ECS Dell for object store: 1 TB bucket
  • Loki Version: 2.8

Service: I am running Loki, Promtail, and Grafana in a Docker-compose setup.

Settings: Promtail is configured to transfer logs from Kafka to Loki, with a log rate of approximately 850,000 logs per minute.

Issue: When querying the logs via Grafana, I am experiencing performance issues. A query for the last 1 hour takes a significant amount of time, and a query for the last 30 days takes around 5 minutes to view just 1000 logs, often resulting in a timeout error. During the query, CPU usage spikes to 99% across all 16 cores and then drops back down once the query finishes or fails.

The size of the logs is about 500 MB when compressed and 2.7 GB when uncompressed. I have come across the term TSDB (Time Series Database) but am uncertain how to incorporate it into my setup, or if it would require an additional server.

Objective: I am seeking advice on optimizing my Loki setup to handle heavy querying more efficiently and quickly.

Loki Config:

auth_enabled: false
server:
  http_listen_port: 3100
common:
  path_prefix: /etc/loki/
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory
  storage:
    s3:
      endpoint: https://ecs.server.net
      insecure: false
      bucketnames: clickhouse_test_bucket
      access_key_id: clickhouse_test_user
      secret_access_key: secret-key
schema_config:
  configs:
    - from: 2023-01-01
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: index_
        period: 24h
storage_config:
  boltdb_shipper:
    active_index_directory: /etc/loki/index
    cache_location: /etc/loki/index_cache
    shared_store: s3
  aws:
    endpoint: https://ecs.server.net
    insecure: false
    bucketnames: clickhouse_test_bucket
    access_key_id: clickhouse_test_user
    secret_access_key: secret-key
    s3forcepathstyle: true
compactor:
  working_directory: /etc/loki/
  shared_store: s3
  compaction_interval: 5m
ruler:
  storage:
    s3:
      bucketnames: clickhouse_test_bucket
limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 48h
  max_global_streams_per_user: 10000
  max_entries_limit_per_query: 50000
  ingestion_rate_mb: 4190
  ingestion_rate_strategy: global
  max_line_size: 100000
  query_timeout: 5m
frontend:
  log_queries_longer_than: 5m
  max_body_size: 1048576
  query_stats_enabled: false
  max_outstanding_per_tenant: 100
  querier_forget_delay: 0s
  scheduler_address: ""
  scheduler_dns_lookup_period: 10s
  scheduler_worker_concurrency: 5
ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 3m
  chunk_retain_period: 30s
  max_transfer_retries: 0
chunk_store_config:
  max_look_back_period: 0s
querier:
    engine:
        timeout: 5m

I have been working with this service for three months, so I am relatively new to this and would greatly appreciate any suggestions on how to tweak the service to handle heavy queries and improve speed. Thank you in advance for your help!

A lot of Loki’s performance gain comes from distribution. If it’s possible I’d recommend you to look into running Loki in simple scalable mode.

If you have to do single instance, I’d recommend using local file system instead of S3 since the object storage will only slow Loki down without distribution. There was also a thread from someone else who was running Loki in monolithic mode, there might be some useful information in there for you as well: Loki became very slow after upgrading from 2.3 to 2.73 (and only works in certain circumstances)

1 Like

@tonyswumac

split_queries_by_interval

This actually resolved the issue. Thank you very much, Tony

1 Like