We are using the Grafana-Loki OSS version (Loki-2.9.11 & Grafana-9.2.2) for log analysis. Created a dashboard to load logs but facing performance problems like loading more than 2-3 days logs onto the dashboard.
We have recently upgraded the loki version from 2.7.5 (with bolddb) to 2.9.11(tsdb) though performance will be better but still not.
Setup: Loki cluster 4 servers with (each 8GB RAM and 2 vCPU) and Grafana 2 servers with (each 32GB and 8 vCPU) behind LB. It is a virtual machine on GCP.
Issue: With the above setup, able to load only 3-4 days logs max, which is around 1-1.2GB and 10-14Lakh lines from a single node with only a time filter on a specific log file. But, we have around 40-50 nodes.
Target: In some cases, we need to search for data logs for about a month from all the nodes with a specific unique ID. And in another case, we need to load at least a week of logs to the dashboard for another different set of server which has but a high number of logs.
Stats for reference: Loaded 3 days logs on the dashboard. It took around 60-120secs and the total bytes were processed for about 1GB with total number rows 949997. However, the summary exec time is 3.64 s. Resource consumed around 30-40% of RAM and 15-20% of CPU.
Configuration
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
http_tls_config: &tls_server_config
cert_file: /etc/loki/ssl-cert.pem
key_file: /etc/loki/ssl-key.pem
grpc_server_max_recv_msg_size: 104857600 # Set max receive size to 100 MB (default is 4MB)
grpc_server_max_send_msg_size: 104857600 # Set max send size to 100 MB (default is 4MB)
log_level: debug # Change to 'debug' if troubleshooting is needed
storage_config:
boltdb_shipper:
active_index_directory: /loki/boltdb-shipper-active
cache_location: /loki/boltdb-shipper-cache
cache_ttl: 24h
shared_store: gcs
tsdb_shipper:
active_index_directory: /var/lib/loki/index
cache_location: /var/lib/loki/cache
cache_ttl: 24h
gcs:
bucket_name: loki-bucket
common:
path_prefix: /var/lib/loki # Update this accordingly, data will be stored here.
replication_factor: 3
ring:
kvstore:
store: memberlist
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 500 # Increased from 100 MB to 500 MB for better performance
parallelise_shardable_queries: true
query_scheduler:
max_outstanding_requests_per_tenant: 8192
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: gcs
schema: v11
index:
prefix: index_
period: 24h # Consider adjusting this if query performance issues arise
- from: 2025-01-03
store: tsdb
object_store: gcs
schema: v13
index:
prefix: index_
period: 24h # Consider adjusting this if query performance issues arise
limits_config:
max_entries_limit_per_query: 5000000
ingestion_rate_mb: 2048 # Increased from 1024 MB to 2048 MB to handle higher ingestion rates
ingestion_burst_size_mb: 4096 # Increased to handle bursts
split_queries_by_interval: 15m # Moved from query_range to limits_config
max_streams_per_user: 10000 # Increase the max streams per user if needed
max_global_streams_per_user: 100000 # Increase the global max streams per user if needed
per_stream_rate_limit: 15MB # Increase the per-stream rate limit to 15MB
per_stream_rate_limit_burst: 25MB # Allow bursting up to 25MB
query_timeout: 300s
creation_grace_period: 10h
retention_period: 365d # 365 days
max_query_lookback: 365d # 365 days
tsdb_max_query_parallelism: 512
memberlist:
join_members:
# You can use a headless k8s service for all distributor, ingester and querier components.
- loki11001:7946
- loki11002:7946
- loki11003:7946
- loki11004:7946
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 3
final_sleep: 0s
chunk_idle_period: 5m # Adjusting to handle high-volume ingestion
max_chunk_age: 2h
chunk_retain_period: 30s
ingester_client:
grpc_client_config:
max_recv_msg_size: 104857600
max_send_msg_size: 104857600
distributor:
ring:
kvstore:
store: memberlist
frontend:
compress_responses: true
log_queries_longer_than: 5s # Logs queries taking longer than 5 seconds for monitoring
max_outstanding_per_tenant: 4096
compactor:
working_directory: /var/lib/loki/compactor
shared_store: gcs # Ensure compactor is using the same storage backend
retention_enabled: true
delete_request_store: gcs
Is there any better way to reduce the request processing time by optimization?
If not, what’s the best infra setup to get it to respond faster?
Created a dashboard to load logs but facing performance problems like loading more than 2-3 days logs onto the dashboard.
Loaded 3 days logs on the dashboard. It took around 60-120secs and the total bytes were processed for about 1GB with total number rows 949997
So you are trying to load around a million of 1KB logs into Grafana dashboard? It is likely Grafana will eat all the available RAM and CPU for displaying such big amounts of logs. Why do you need loading a million of logs into Grafana dashboard? A typical human can investigate at most a few hundreds of 1KB logs, and this will take at least 5 minutes. It will take many years in order to investigate a million of logs by a typical human.
Probably, you need to look at some aggregates over a million of logs. For example, the number of logs grouped by some field, so the total number of displayed logs in Grafana is less than 100.
If you try saving a million of logs from Loki into file for further processing by some external tool, then this is quite frequent case. Loki limits the number of returned logs to 100 by default according to these docs (see the default value for limit arg). If you’ll try setting the limit to a million, it is likely Loki won’t process the query because of lack of RAM and/or CPU So Loki isn’t suitable for exporting large volumes of logs, which match some filter. I’d recommend trying VictoriaLogs for this task. It supports exporting arbitrary number of matching logs in a single query, and it doesn’t need a lot of RAM / CPU for this. See these docs for details.