Grafana-Loki Performance Optimization to Load high volume logs on Dashboard

We are using the Grafana-Loki OSS version (Loki-2.9.11 & Grafana-9.2.2) for log analysis. Created a dashboard to load logs but facing performance problems like loading more than 2-3 days logs onto the dashboard.

We have recently upgraded the loki version from 2.7.5 (with bolddb) to 2.9.11(tsdb) though performance will be better but still not.

Setup: Loki cluster 4 servers with (each 8GB RAM and 2 vCPU) and Grafana 2 servers with (each 32GB and 8 vCPU) behind LB. It is a virtual machine on GCP.

Issue: With the above setup, able to load only 3-4 days logs max, which is around 1-1.2GB and 10-14Lakh lines from a single node with only a time filter on a specific log file. But, we have around 40-50 nodes.

Target: In some cases, we need to search for data logs for about a month from all the nodes with a specific unique ID. And in another case, we need to load at least a week of logs to the dashboard for another different set of server which has but a high number of logs.

Stats for reference: Loaded 3 days logs on the dashboard. It took around 60-120secs and the total bytes were processed for about 1GB with total number rows 949997. However, the summary exec time is 3.64 s. Resource consumed around 30-40% of RAM and 15-20% of CPU.

Configuration

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  http_tls_config: &tls_server_config
    cert_file: /etc/loki/ssl-cert.pem
    key_file: /etc/loki/ssl-key.pem
  grpc_server_max_recv_msg_size: 104857600 # Set max receive size to 100 MB (default is 4MB)
  grpc_server_max_send_msg_size: 104857600  # Set max send size to 100 MB (default is 4MB)
  log_level: debug  # Change to 'debug' if troubleshooting is needed

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
    cache_ttl: 24h
    shared_store: gcs
  tsdb_shipper:
    active_index_directory: /var/lib/loki/index
    cache_location: /var/lib/loki/cache
    cache_ttl: 24h
  gcs:
    bucket_name: loki-bucket

common:
  path_prefix: /var/lib/loki # Update this accordingly, data will be stored here.
  replication_factor: 3
  ring:
    kvstore:
      store: memberlist

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 500  # Increased from 100 MB to 500 MB for better performance
  parallelise_shardable_queries: true

query_scheduler:
  max_outstanding_requests_per_tenant: 8192

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: gcs
      schema: v11
      index:
        prefix: index_
        period: 24h  # Consider adjusting this if query performance issues arise
    - from: 2025-01-03
      store: tsdb
      object_store: gcs
      schema: v13
      index:
        prefix: index_
        period: 24h  # Consider adjusting this if query performance issues arise

limits_config:
  max_entries_limit_per_query: 5000000
  ingestion_rate_mb: 2048  # Increased from 1024 MB to 2048 MB to handle higher ingestion rates
  ingestion_burst_size_mb: 4096  # Increased to handle bursts
  split_queries_by_interval: 15m  # Moved from query_range to limits_config
  max_streams_per_user: 10000  # Increase the max streams per user if needed
  max_global_streams_per_user: 100000  # Increase the global max streams per user if needed
  per_stream_rate_limit: 15MB  # Increase the per-stream rate limit to 15MB
  per_stream_rate_limit_burst: 25MB  # Allow bursting up to 25MB
  query_timeout: 300s
  creation_grace_period: 10h
  retention_period: 365d   # 365 days
  max_query_lookback: 365d # 365 days
  tsdb_max_query_parallelism: 512

memberlist:
  join_members:
  # You can use a headless k8s service for all distributor, ingester and querier components.
   - loki11001:7946
   - loki11002:7946
   - loki11003:7946
   - loki11004:7946

ingester:
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 3
    final_sleep: 0s
  chunk_idle_period: 5m  # Adjusting to handle high-volume ingestion
  max_chunk_age: 2h
  chunk_retain_period: 30s
ingester_client:
  grpc_client_config:
    max_recv_msg_size: 104857600
    max_send_msg_size: 104857600
distributor:
  ring:
    kvstore:
      store: memberlist

frontend:
  compress_responses: true
  log_queries_longer_than: 5s  # Logs queries taking longer than 5 seconds for monitoring
  max_outstanding_per_tenant: 4096

compactor:
  working_directory: /var/lib/loki/compactor
  shared_store: gcs  # Ensure compactor is using the same storage backend
  retention_enabled: true
  delete_request_store: gcs

Is there any better way to reduce the request processing time by optimization?
If not, what’s the best infra setup to get it to respond faster?

Thanks for your help!

Created a dashboard to load logs but facing performance problems like loading more than 2-3 days logs onto the dashboard.

Loaded 3 days logs on the dashboard. It took around 60-120secs and the total bytes were processed for about 1GB with total number rows 949997

So you are trying to load around a million of 1KB logs into Grafana dashboard? It is likely Grafana will eat all the available RAM and CPU for displaying such big amounts of logs. Why do you need loading a million of logs into Grafana dashboard? A typical human can investigate at most a few hundreds of 1KB logs, and this will take at least 5 minutes. It will take many years in order to investigate a million of logs by a typical human.

Probably, you need to look at some aggregates over a million of logs. For example, the number of logs grouped by some field, so the total number of displayed logs in Grafana is less than 100.

If you try saving a million of logs from Loki into file for further processing by some external tool, then this is quite frequent case. Loki limits the number of returned logs to 100 by default according to these docs (see the default value for limit arg). If you’ll try setting the limit to a million, it is likely Loki won’t process the query because of lack of RAM and/or CPU :frowning: So Loki isn’t suitable for exporting large volumes of logs, which match some filter. I’d recommend trying VictoriaLogs for this task. It supports exporting arbitrary number of matching logs in a single query, and it doesn’t need a lot of RAM / CPU for this. See these docs for details.

Appreciate the responses…

I have done some more understanding of loki configurations then optimized the loki. There are below,

auth_enabled: false

compactor:
  working_directory: /var/lib/loki/compactor
  shared_store: gcs  # Ensure compactor is using the same storage backend
  retention_enabled: true
  delete_request_store: gcs
  retention_delete_delay: 24h

common:
  path_prefix: /var/lib/loki # Update this accordingly, data will be stored here.
  replication_factor: 3
  ring:
    kvstore:
      store: memberlist

distributor:
  ring:
    kvstore:
      store: memberlist

frontend:
  compress_responses: true  # Enable HTTP response compression (works with HTTPS)
  log_queries_longer_than: 5s  # Log queries taking longer than 2 seconds
  grpc_client_config:
    max_recv_msg_size: 268435456
    max_send_msg_size: 268435456
  scheduler_worker_concurrency: 20
  max_outstanding_per_tenant: 8192

frontend_worker:
  grpc_client_config:
    max_send_msg_size: 268435456 # 256MB
    max_recv_msg_size: 268435456 # 256MB

#index_gateway:
#  mode: ring
#  ring:
#    kvstore:
#      store: memberlist

ingester:
  chunk_idle_period: 5m  # Adjusting to handle high-volume ingestion
  chunk_retain_period: 30s
  lifecycler:
    ring:
      kvstore:
        store: memberlist
      replication_factor: 3
  wal:
    flush_on_shutdown: true

ingester_client:
  grpc_client_config:
    max_recv_msg_size: 268435456
    max_send_msg_size: 268435456

limits_config:
  creation_grace_period: 10h
  ingestion_rate_mb: 2048  # Increased from 1024 MB to 2048 MB to handle higher ingestion rates
  ingestion_burst_size_mb: 4096  # Increased to handle bursts
  max_entries_limit_per_query: 5000000
  max_global_streams_per_user: 100000  # Increase the global max streams per user if needed
  max_query_lookback: 365d # 365 days
  max_query_parallelism: 512
  max_streams_per_user: 0  # Increase the max streams per user if needed
  per_stream_rate_limit: 25MB  # Increase the per-stream rate limit to 25MB
  per_stream_rate_limit_burst: 50MB  # Allow bursting up to 50MB
  retention_period: 365d   # 365 days
  query_timeout: 300s
  split_queries_by_interval: 15m  # Moved from query_range to limits_config
  tsdb_max_bytes_per_shard: 1024MB
  tsdb_max_query_parallelism: 2048

memberlist:
  join_members:
   - loki01:7946
   - loki02:7946
   - loki03:7946
   - loki04:7946

querier:
  max_concurrent: 15

query_range:
  cache_results: true
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 500
  parallelise_shardable_queries: true

query_scheduler:
  max_outstanding_requests_per_tenant: 16384
  max_queue_hierarchy_levels: 0

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: gcs
      schema: v11
      index:
        prefix: index_
        period: 24h  # Consider adjusting this if query performance issues arise
    - from: 2025-01-03
      store: tsdb
      object_store: gcs
      schema: v13
      index:
        prefix: index_
        period: 24h  # Consider adjusting this if query performance issues arise

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  http_tls_config: &tls_server_config
    cert_file: /etc/loki/ssl-cert.pem
    key_file: /etc/loki/ssl-key.pem
  grpc_server_max_concurrent_streams: 1000
  grpc_server_max_recv_msg_size: 268435456 # Set max receive size to 256 MB
  grpc_server_max_send_msg_size: 268435456  # Set max send size to 256 MB
  http_server_idle_timeout: 120s
  http_server_read_timeout: 120s
  http_server_write_timeout: 120s

  log_level: debug  # Change to 'debug' if troubleshooting is needed

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
    cache_ttl: 24h
    shared_store: gcs
  tsdb_shipper:
    active_index_directory: /var/lib/loki/index
    cache_location: /var/lib/loki/cache
    cache_ttl: 24h
  gcs:
    bucket_name: loki-bucket

I would be happy to accept the feedback for this configuration has required any modifications and new parameters needs to add.

Thanks in advance…!