Hello Loki Team!
I would like some help tweaking the performance of our Loki setup. I’ve read blog posts about blazing speeds but can’t seem to achieve this. The read path is what I am trying to optimize.
The current setup is Loki 2.0 in microservices mode deployed to AWS using S3 and DynamoDB for chunks/index. We are using memberlist for the Ring. ElasticCache Redis has been added as the cache.
We are running 2 query frontends, 3 distributors, 3 ingesters, 6 queriers and 1 table manager containers. These all run in ECS Fargate which maybe part of the problem. chunk_retain_period is very low because Fargate does not have persistent storage. My thought is the queriers have to pull everything from S3, but that would be the case for any query greater than your chunk_retain_period. The above mentioned blog post is run over a 1 hour period, perhaps these chunks had not been flushed? Served from ingester on persistent storage rather than from S3?
Anyway, please see my benchmarks below, and config further down.
Some benchmarking… (only queriers were scaled out)
6 Queriers over 48h of logs just straight up timed out.
12 Queriers…
time ./logcli query '{Environment=“production”,Application=“test-app”,Deployment=“production”,Service=“test-service”,LogGroup=“test-service-logs”} |~ “(?i)pppppppppppp” ’ --since=48h --stats
Ingester.TotalReached 12
Ingester.TotalChunksMatched 72
Ingester.TotalBatches 0
Ingester.TotalLinesSent 0
Ingester.HeadChunkBytes 923 kB
Ingester.HeadChunkLines 755
Ingester.DecompressedBytes 85 MB
Ingester.DecompressedLines 63807
Ingester.CompressedBytes 14 MB
Ingester.TotalDuplicates 0
Store.TotalChunksRef 8144
Store.TotalChunksDownloaded 8144
Store.ChunksDownloadTime 8m38.330467263s
Store.HeadChunkBytes 0 B
Store.HeadChunkLines 0
Store.DecompressedBytes 59 GB
Store.DecompressedLines 44850257
Store.CompressedBytes 9.6 GB
Store.TotalDuplicates 0
Summary.BytesProcessedPerSecond 375 MB
Summary.LinesProcessedPerSecond 285792
Summary.TotalBytesProcessed 59 GB
Summary.TotalLinesProcessed 44914819
Summary.ExecTime 2m37.158656521s
real 2m37.267s
user 0m0.119s
sys 0m0.016s
24 Queriers…
Ingester.TotalReached 12
Ingester.TotalChunksMatched 56
Ingester.TotalBatches 0
Ingester.TotalLinesSent 0
Ingester.HeadChunkBytes 1.0 MB
Ingester.HeadChunkLines 841
Ingester.DecompressedBytes 84 MB
Ingester.DecompressedLines 63012
Ingester.CompressedBytes 14 MB
Ingester.TotalDuplicates 0
Store.TotalChunksRef 8122
Store.TotalChunksDownloaded 8122
Store.ChunksDownloadTime 5m31.764491701s
Store.HeadChunkBytes 0 B
Store.HeadChunkLines 0
Store.DecompressedBytes 59 GB
Store.DecompressedLines 44704040
Store.CompressedBytes 9.6 GB
Store.TotalDuplicates 0
Summary.BytesProcessedPerSecond 500 MB
Summary.LinesProcessedPerSecond 380884
Summary.TotalBytesProcessed 59 GB
Summary.TotalLinesProcessed 44767893
Summary.ExecTime 1m57.536630781s
real 1m57.730s
user 0m0.104s
sys 0m0.009s
There are some gains here, but I am still not seeing the performance showcased in the blog. I have the /metrics endpoint being ingested if you would like to see something specific.
Config below, LOKI_ prefixed vars are replaced on deploy.
auth_enabled: false
server:
http_listen_address: 0.0.0.0
http_listen_port: LOKI_HTTP_LISTEN_PORT
grpc_listen_address: 0.0.0.0
grpc_listen_port: LOKI_GRPC_LISTEN_PORT
http_server_read_timeout: 3m
http_server_write_timeout: 3m
log_level: LOKI_LOG_LEVEL
memberlist:
bind_port: LOKI_HTTP_MEMBERLIST_LISTEN_PORT
join_members:
- dns+LOKI_QUERIER_RECORD_NAME.LOKI_SERVICE_DISCOVERY_NAMESPACE:LOKI_HTTP_MEMBERLIST_LISTEN_PORT
- dns+LOKI_DISTRIBUTOR_RECORD_NAME.LOKI_SERVICE_DISCOVERY_NAMESPACE:LOKI_HTTP_MEMBERLIST_LISTEN_PORT
- dns+LOKI_INGESTER_RECORD_NAME.LOKI_SERVICE_DISCOVERY_NAMESPACE:LOKI_HTTP_MEMBERLIST_LISTEN_PORT
max_join_backoff: 1m
max_join_retries: 10
min_join_backoff: 1s
dead_node_reclaim_time: 30s
gossip_to_dead_nodes_time: 15s
left_ingesters_timeout: 30s
ingester:
lifecycler:
join_after: 60s
observe_period: 5s
interface_names:
- "eth1"
ring:
kvstore:
store: memberlist
replication_factor: 3
final_sleep: 0s
chunk_idle_period: 15m
max_chunk_age: 1h
chunk_retain_period: 30s
max_transfer_retries: 0
chunk_target_size: 1536000
chunk_block_size: 262144
querier:
query_timeout: 2m
query_ingesters_within: 2h
query_range:
split_queries_by_interval: 30m
align_queries_with_step: true
max_retries: 5
parallelise_shardable_queries: true
cache_results: true
results_cache:
cache:
redis:
endpoint: LOKI_REDIS_ENDPOINT
timeout: 1s
db: 0
schema_config:
configs:
- from: 2020-09-01
store: aws
object_store: aws
schema: v11
index:
prefix: LOKI_LOG_TABLE_PREFIX
period: 168h
storage_config:
aws:
s3: s3://LOKI_AWS_REGION/LOKI_LOG_BUCKET_NAME
dynamodb:
dynamodb_url: dynamodb://LOKI_AWS_REGION
index_cache_validity: 14m
index_queries_cache_config:
redis:
endpoint: LOKI_REDIS_ENDPOINT
timeout: 1s
db: 1
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
ingestion_rate_mb: 16
ingestion_burst_size_mb: 24
max_entries_limit_per_query: 50000
max_query_parallelism: 12
chunk_store_config:
chunk_cache_config:
redis:
endpoint: LOKI_REDIS_ENDPOINT
timeout: 1s
db: 2
write_dedupe_cache_config:
redis:
endpoint: LOKI_REDIS_ENDPOINT
timeout: 1s
db: 3
cache_lookups_older_than: 36h
max_look_back_period: 672h
table_manager:
index_tables_provisioning:
enable_ondemand_throughput_mode: true
enable_inactive_throughput_on_demand_mode: true
retention_deletes_enabled: true
retention_period: 672h
frontend:
log_queries_longer_than: 5s
downstream_url: https://LOKI_QUERIER_RECORD_NAME.LOKI_PRIVATE_ZONE_DOMAIN_NAME
compress_responses: true
max_outstanding_per_tenant: 3600