we are running below setup
2x frontend, 2 querier, 5x ingester, 3xdistributor, 2xrulers, 1xcompactor
All machines are 10 CPU and 128 GB RAM
loki version is 2.9.2 TSDB schema
now i have face no problem doing 255mbps ingestion (10mbps per stream) even more …
it handles load perfectly fine
however when I am doing combined load test i.e. sending around 255mbps (total) load to distributor and 10-15 query requests per second to frontend at the same time
I have noticed that querier and queryfrontend have no resource problems
But somehow ingester gets OOM killed
I initially thought this has to do with one of the following setting
chunk_idle_period
max_concurrent
query_ingesters_within
max_query_parallelism, tsdb_max_query_parallelism
so I brought them down (see the config attached). No improvement
so I enabled all metric collection and noticed sudden increase in loki_logql_querystats_ingester_sent_lines_total when I start to query
I am not totally sure what is this metric
are all queries are going to ingester even if I set query_ingesters_within to low value ?
is anything wrong in config ?
is 10 log queries per second too much for given setup
if yes how to we put some restriction on this so ingester don’t get OOM killed ?
all queries are log queries
range: now – to – now-30 days
below are all config (some obfuscated XXXX) and metric I collected
server:
grpc_listen_address: 127.0.0.1
grpc_listen_port: 49205
grpc_server_max_recv_msg_size: 75165824
grpc_server_max_send_msg_size: 75165824
http_listen_address: 127.0.0.1
http_listen_port: 49105
http_server_read_timeout: 5m
http_server_write_timeout: 5m
schema_config:
configs:
- from: 2021-01-01
index:
period: 24h
prefix: index_
object_store: aws
schema: v11
store: boltdb-shipper
- from: '2024-03-01'
index:
period: 24h
prefix: tsdb_index_
object_store: aws
schema: v12
store: tsdb
storage_config:
aws:
bucketnames: XXXX
endpoint: https://XXXX/
region: us-east-1
s3: loki
s3forcepathstyle: true
boltdb_shipper:
active_index_directory: /XXXX/boltdb-shipper-active
cache_location: /XXXXXX/boltdb-shipper-cache
cache_ttl: 24h
shared_store: aws
tsdb_shipper:
active_index_directory: XXX/tsdb-index
cache_location: XXXXX/tsdb-cache
cache_ttl: 24h
ingester:
chunk_idle_period: 4m
chunk_retain_period: 30s
chunk_target_size: 1572864
lifecycler:
address: XXXX
final_sleep: 0s
port: XXX
ring:
kvstore:
store: memberlist
replication_factor: 3
max_chunk_age: 10m
max_transfer_retries: 0
wal:
dir: /XXXX
enabled: true
replay_memory_ceiling: 8GB
distributor:
{}
querier:
max_concurrent: 4
multi_tenant_queries_enabled: true
query_ingesters_within: 10m
query_store_only: false
query_range:
align_queries_with_step: true
cache_results: true
max_retries: 1
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 1024
ttl: 1h
frontend:
compress_responses: false
log_queries_longer_than: 15s
frontend_worker:
frontend_address: XXXX:25206
parallelism: 4
common:
compactor_addressXXX:29105
compactor_grpc_address: XXX:29205
compactor:
retention_enabled: true
shared_store: aws
working_directory: XXXX/tsdb-compactor
table_manager:
chunk_tables_provisioning:
inactive_read_throughput: 0
inactive_write_throughput: 0
provisioned_read_throughput: 0
provisioned_write_throughput: 0
index_tables_provisioning:
inactive_read_throughput: 0
inactive_write_throughput: 0
provisioned_read_throughput: 0
provisioned_write_throughput: 0
limits_config:
cardinality_limit: 100000
enforce_metric_name: false
ingestion_burst_size_mb: 20
ingestion_rate_mb: 15
ingestion_rate_strategy: local
max_chunks_per_query: 1000000
max_entries_limit_per_query: 5000
max_global_streams_per_user: 11000
max_query_length: 30d
max_query_lookback: 30d
max_query_parallelism: 8
max_query_range: 30d
max_query_series: 500
per_stream_rate_limit: 10MB
per_stream_rate_limit_burst: 20MB
query_timeout: 1m
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 744h
split_queries_by_interval: 15m0s
tsdb_max_query_parallelism: 40 this is too low I know I was just checking if its contributing factor
ingester sending lines to querier / qfe when high tps query
ingester mem used : all nodes with 128 GB RAM (with no requests to QFE normal utilization is around 14 GB - only ingestion)
All ingesters : mem used
Yes I know chunk util is low
This is because I wanted to flush chunks soon in backend
To see if that can help with ingester