Loki 2.9.2 ingester out of memory even with minimum settings when we send queryfrontend around 10 request per second

we are running below setup
2x frontend, 2 querier, 5x ingester, 3xdistributor, 2xrulers, 1xcompactor
All machines are 10 CPU and 128 GB RAM
loki version is 2.9.2 TSDB schema

now i have face no problem doing 255mbps ingestion (10mbps per stream) even more …
it handles load perfectly fine
however when I am doing combined load test i.e. sending around 255mbps (total) load to distributor and 10-15 query requests per second to frontend at the same time
I have noticed that querier and queryfrontend have no resource problems
But somehow ingester gets OOM killed

I initially thought this has to do with one of the following setting
chunk_idle_period
max_concurrent
query_ingesters_within
max_query_parallelism, tsdb_max_query_parallelism
so I brought them down (see the config attached). No improvement

so I enabled all metric collection and noticed sudden increase in loki_logql_querystats_ingester_sent_lines_total when I start to query
I am not totally sure what is this metric
are all queries are going to ingester even if I set query_ingesters_within to low value ?

is anything wrong in config ?
is 10 log queries per second too much for given setup
if yes how to we put some restriction on this so ingester don’t get OOM killed ?

all queries are log queries
range: now – to – now-30 days

below are all config (some obfuscated XXXX) and metric I collected

server:
  grpc_listen_address: 127.0.0.1
  grpc_listen_port: 49205
  grpc_server_max_recv_msg_size: 75165824
  grpc_server_max_send_msg_size: 75165824
  http_listen_address: 127.0.0.1
  http_listen_port: 49105
  http_server_read_timeout: 5m
  http_server_write_timeout: 5m


schema_config:
  configs:
  - from: 2021-01-01
    index:
      period: 24h
      prefix: index_
    object_store: aws
    schema: v11
    store: boltdb-shipper
  - from: '2024-03-01'
    index:
      period: 24h
      prefix: tsdb_index_
    object_store: aws
    schema: v12
    store: tsdb

storage_config:
  aws:
    bucketnames: XXXX
    endpoint: https://XXXX/
    region: us-east-1
    s3: loki
    s3forcepathstyle: true

  boltdb_shipper:
    active_index_directory: /XXXX/boltdb-shipper-active
    cache_location: /XXXXXX/boltdb-shipper-cache
    cache_ttl: 24h

    shared_store: aws
  tsdb_shipper:
    active_index_directory: XXX/tsdb-index
    cache_location: XXXXX/tsdb-cache
    cache_ttl: 24h

ingester:
  chunk_idle_period: 4m
  chunk_retain_period: 30s
  chunk_target_size: 1572864
  lifecycler:
    address: XXXX
    final_sleep: 0s
    port: XXX
    ring:
      kvstore:
        store: memberlist
      replication_factor: 3
  max_chunk_age: 10m
  max_transfer_retries: 0
  wal:
    dir: /XXXX
    enabled: true
    replay_memory_ceiling: 8GB

distributor:
  {}


querier:
  max_concurrent: 4
  multi_tenant_queries_enabled: true
  query_ingesters_within: 10m
  query_store_only: false


query_range:
  align_queries_with_step: true
  cache_results: true
  max_retries: 1
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 1024
        ttl: 1h


frontend:
  compress_responses: false
  log_queries_longer_than: 15s


frontend_worker:
  frontend_address: XXXX:25206
  parallelism: 4

common:
  compactor_addressXXX:29105
  compactor_grpc_address: XXX:29205


compactor:
  retention_enabled: true
  shared_store: aws
  working_directory: XXXX/tsdb-compactor


table_manager:
  chunk_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0
  index_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0


limits_config:
  cardinality_limit: 100000
  enforce_metric_name: false
  ingestion_burst_size_mb: 20
  ingestion_rate_mb: 15
  ingestion_rate_strategy: local
  max_chunks_per_query: 1000000
  max_entries_limit_per_query: 5000
  max_global_streams_per_user: 11000
  max_query_length: 30d
  max_query_lookback: 30d
  max_query_parallelism: 8
  max_query_range: 30d
  max_query_series: 500
  per_stream_rate_limit: 10MB
  per_stream_rate_limit_burst: 20MB
  query_timeout: 1m
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 744h
  split_queries_by_interval: 15m0s
  tsdb_max_query_parallelism: 40   this is too low I know I was just checking if its contributing factor

ingester sending lines to querier / qfe when high tps query

ingester mem used : all nodes with 128 GB RAM (with no requests to QFE normal utilization is around 14 GB - only ingestion)
All ingesters : mem used

Yes I know chunk util is low
This is because I wanted to flush chunks soon in backend
To see if that can help with ingester


just wanted to put one note
traffic 10 request per second to frontend is being genrated using jmeter with logql
like i mentioned all queries are almost same spannig over range now < → now -30days

hitting frontend VIP

https://<frontend_vip>:25106/loki/api/v1/query_range?query=%7Bid=%xxxx%7Bapplication=%22xxxxi%22%7D&%7b__tenant_id__%3d%22asdasd193206-qa%22%7d+%7c%7e+%22error%7cexception%22&end=1710923141&start=1710318341

so its filtering for word error with logql

it looks like after splitting all first part is going to ingester
how to keep limit on that
how to keep limit on log lines that are being requested from ingester in this case

Interesting. One would think that with the chunks in memory it would not take much additional resources to query them. I did notice the flush count spike as well.

How about if you try to keep chunks in ingester longer? Let’s say we change:

max_chunk_age: 2h
chunk_idle_period: 90m
query_ingesters_within: 3h

Just as a test and see what happens.

1 Like

hi @ tonyswumac. i spent many days troubleshooting this and found that issue shows up only with new loki ver 2.9+ binaries. meaning if i keep config file same and just use 2.8.2 version binaries it works completely fine

so what causes ingester to go OOM in 2.9 ?
well from stress test we found below

A. when you are using new binaries frontend_worker is not properly queuing up request
meaning if request rate is too much frontend_queue_length should increase
and queriers should gradually pull from queue (based on allowed parallelism and max_concurrent)
but that doesnt seem to be the case with this new ver
i see queue length hardly going up … it just tries to execute all at once
and since most of queries are (now - X) all are trying to fetch data from ingester at a time

B. morover it seems that max_chunks_per_query is not working in 2.9
i see it is as open issue.

and combined result is ingester gets OOM killed when you are ingesting and querying at a same time (ingestion around 250 mbps and read rate is like 10 req per sec, at a time using jmeter )

here is comparison of 2 runs
5.00 - 5.30 using loki ver 2.9.4
6.10 - 6.25 using loki ver 2.8.2
NOTE - i haven’t touched config at all. same config that i posted above being used

see how in first runs frontend does not queue requests at all
queriers are trying to execute absolutely everything that is being sent to query_range api of frontend

mem consumption by ingester

proof that load was same at ingestion side
image

1 Like

Good to know. I’d say probably more related to the max chunk limit not working then.