Increase Memory and Prevent Node Crashes in Grafana Loki Cluster

Hello
I’m encountering a critical issue in my Grafana Loki cluster setup, and I’m seeking advice and assistance from the community to resolve it. Here are the details of my configuration and the problem I’m facing:

Configuration:

I have a Grafana Loki cluster consisting of multiple nodes with varying hardware specifications:

Read Nodes:

  • RAM: 8GB
  • Cores: 4
  • Disk: 150GB
  • Quantity: 5 nodes

Write Nodes:

  • RAM: 8GB
  • Cores: 4
  • Disk: 150GB
  • Quantity: 4 nodes

Promtail Nodes:

  • RAM: 4GB
  • Cores: 4
  • Disk: 200GB
  • Quantity: 2 nodes

Issue:

The issue arises when Promtail transfers logs from 3 Kafka nodes at a rate of approximately 800 GB per second. This high ingestion rate causes problems on the Loki write nodes, primarily:

  1. RAM Usage: The RAM on one of the write nodes increases significantly.
  2. Node Crashes: As a result of the increased RAM usage, one of the write nodes eventually crashes.

Loki Configuration:

Here’s a portion of my Loki configuration (common section) to provide context:

auth_enabled: true

server:
  http_listen_address: 0.0.0.0
  grpc_listen_address: 0.0.0.0
  http_listen_port: 3100
  grpc_listen_port: 9095
  log_level: info

common:
  ring:
    kvstore:
      store: memberlist
  path_prefix: /etc/loki/
  storage:
    s3:
      access_key_id: access_key
      bucketnames: loki-main
      endpoint: https://s3.node.net
      insecure: false
      region: null
      s3: null
      s3forcepathstyle: true
      secret_access_key: secret_key
  compactor_address: http://node-read1:3100
  replication_factor: 3

memberlist:
  join_members:
    - node-read1:7946
    - node-read2:7946
    - node-read3:7946
    - node-read4:7946
    - node-read5:7946
    - node-write1:7946
    - node-write2:7946
    - node-write3:7946
    - node-write4:7946
  dead_node_reclaim_time: 30s
  gossip_to_dead_nodes_time: 15s
  left_ingesters_timeout: 30s
  gossip_interval: 2s

storage_config:
  boltdb_shipper:
    active_index_directory: /etc/loki/index
    build_per_tenant_index: true
    cache_location: /etc/loki/index_cache
    shared_store: s3
  tsdb_shipper:
    active_index_directory: /etc/loki/tsdb-index
    cache_location: /etc/loki/tsdb-cache
    shared_store: s3
  aws:
    access_key_id: access_key
    bucketnames: loki-main
    endpoint: https://s3.node.net
    insecure: false
    region: null
    s3: null
    s3forcepathstyle: true
    secret_access_key: secret_key

ingester:
  lifecycler:
    join_after: 10s
    observe_period: 5s
    ring:
      replication_factor: 1
      kvstore:
        store: memberlist
    final_sleep: 0s
  chunk_idle_period: 1m
  wal:
    enabled: true
    dir: /loki/wal
  max_chunk_age: 1m
  chunk_retain_period: 30s
  chunk_encoding: snappy
  chunk_target_size: 1.572864e+06
  chunk_block_size: 262144
  flush_op_timeout: 10s

ruler:
  enable_api: true
  enable_sharding: true  
  wal:
    dir: /loki/ruler-wal
  storage:
    s3:
      bucketnames: loki-data

distributor:
  ring:
    kvstore:
      store: memberlist

schema_config:
  configs:
  - from: 2020-08-01
    store: boltdb-shipper
    object_store: s3
    schema: v11
    index:
      prefix: index_
      period: 24h
  - from: 2023-07-11
    store: tsdb
    object_store: s3
    schema: v12
    index:
      prefix: index_
      period: 24h
    chunks:
      prefix: chunk_
      period: 24h

limits_config:
  max_cache_freshness_per_query: '10m'
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 48h
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  # parallelize queries in 15min intervals
  split_queries_by_interval: 15m
  volume_enabled: true

chunk_store_config:
  max_look_back_period: 336h

table_manager:
  retention_deletes_enabled: true
  retention_period: 336h

query_range:
  # make queries more cache-able by aligning them with their step intervals
  align_queries_with_step: true
  max_retries: 5
  parallelise_shardable_queries: true
  cache_results: true

frontend:
  log_queries_longer_than: 5s
  compress_responses: true
  max_outstanding_per_tenant: 2048
  grpc_client_config:
    max_send_msg_size: 104857600
  parallelism: 9

query_scheduler:
  max_outstanding_requests_per_tenant: 32768

querier:
  query_ingesters_within: 2h

compactor:
  working_directory: /etc/loki/
  shared_store: s3
  compaction_interval: 5m

Docker Compose:

I’ve used Docker Compose to manage my Loki containers. Below are snippets of my Docker Compose configuration for both read and write nodes:

Read Nodes:

version: "3"

services:
  loki:
    image: grafana/loki:2.9.0
    network_mode: "host"
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
    command: -config.file=/etc/loki/local-config.yaml --target=read

Write Nodes:

version: "3"

services:
  loki:
    image: grafana/loki:2.9.0
    network_mode: "host"
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
    command: -config.file=/etc/loki/local-config.yaml --target=write

Promtail Configuration:

Here’s a snippet of my Promtail configuration for sending logs to Loki write nodes:

clients:
  - url: http://write-node1:3100/loki/api/v1/push
  - url: http://write-node2:3100/loki/api/v1/push
  - url: http://write-node3:3100/loki/api/v1/push
  - url: http://write-node4:3100/loki/api/v1/push

Request for Assistance:

I’m looking for guidance and recommendations on how to address the high RAM usage and node crashes on my Loki write nodes when dealing with such a high ingestion rate. Any advice on optimizing my configuration, resource allocation, or other best practices to ensure the stability and performance of my Loki cluster would be greatly appreciated.

Thank you for your help!


i increase the write nodes to 16 GB each
but still one node is going high usage

and if the one node write is down i can’t query from grafana it will give me error
too many unhealthy instances in the ring

Are you really doing 800GB per second? If so I don’t think I am qualified to give you advice, other than your instance size is likely too few and too small for such work load.

That said, I might still be able to point out a couple of things:

  1. Upgrade to 2.9.1. Version 2.9.0 has a bug that you should want to avoid.
  2. You should have a load balancer in front of writer (and reader too). And you should configure your promtail to send to just one URL (the load balancer), instead of to all write nodes.
  3. Disable table manager and use compactor only.

still having ram load on one node with load balancer front of writer and reader

Hello Community,

I recently encountered and successfully resolved an issue by allowing one of the nodes to crash, waiting for an hour, and then restarting the container. The system is now running smoothly again. However, I’ve run into a new challenge: I cannot view logs in Grafana that are older than 24 hours. It seems to be related to a timeout issue, but I’m not sure whether it’s originating from Grafana, Nginx, or Loki.

Here’s a summary of my configurations:

  • Grafana: I’ve set a timeout of less than 300 seconds (5 minutes) in Grafana, which may be causing the timeout problem when attempting to retrieve older logs.
  • Nginx: My Nginx configuration includes a proxy_read_timeout set to 30 minutes. This configuration seems appropriate and should not be causing the issue with older logs.
  • Loki: In Loki, I’ve configured log_queries_longer_than to 10 minutes and query_timeout to 30 minutes, which also appear suitable.

the issue is it not being able to view logs older than 24 hours in Grafana

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.