Hello!
Recently we’ve deployed Loki in the Kubernetes cluster using loki-distributed chart. Currently the incoming amount of logs is about 10 GB per day - not a big amount to be fair. This amount will only grow in the future. It seems there are no problems in logs ingestion. The main problem we are facing is querying.
At first when the amount of daily logs was really small - there was absolutely no problems in executing any filter queries like this for a long period like 24h or even more
{env="prod",job="omni_services"} |= "some_id"
But as we approached even 10GB logs per day the query above can be executed only against up to 3h-6h period. Increasing the interval results in timeouts - and this is the main problem.
There are really a lot of moving parts in Loki and it’s a bit overwhelming what we should tweak and what could exactly have an effect.
The first tier was increasing dataproxy timeout in grafana.ini - we increased it up to 600s (5min)
grafana.ini:
dataproxy:
timeout: 600
logging: false
But these 5m seems to be not enough for the query to complete.
The next thing we tweaked was the Loki’s querier config (increased engine.timeout to 5m)
querier:
query_timeout: 1m
engine:
timeout: 5m
Currently we have 1 ingester, 1 distributor, 1 index-gateway, 1 query-frontend and 4 queriers.
Each of the queriers consumes CPU not more than 200-300mcpu - this is probably the main concern of ours - why queriers not using a lot of cpu.
Also we tried to tweak split_queries_by_interval setting but that seems like had no real impact. Currently we set it to 30m.
There are no particular errors in the queriers in logs - only when the request is cancelled by either nginx or grafana timeout.
So, the main problem and question is how to properly configure and scale Loki to be able to execute any reasonable queries against it. How can we understand what is the bottle neck of the entire setup. In other words, how to effectively speed up queries so we we are not hitting any timeouts.
We are using s3 storage for chunks and indexes.
Complete loki config is the following:
auth_enabled: true
server:
log_level: info
# Must be set to 3100
http_listen_port: 3100
grpc_server_max_recv_msg_size: 8388608 # 8 Mb
grpc_server_max_send_msg_size: 8388608 # 8 Mb
distributor:
ring:
kvstore:
store: memberlist
memberlist:
join_members:
- {{ include "loki.fullname" . }}-memberlist
ingester:
lifecycler:
join_after: 0s
ring:
kvstore:
store: memberlist
replication_factor: 1
# Disable chunk transfer which is not possible with statefulsets
# and unnecessary for boltdb-shipper
max_transfer_retries: 0
chunk_idle_period: 30m
chunk_block_size: 262144
chunk_target_size: 1536000
chunk_encoding: snappy
chunk_retain_period: 1m
max_chunk_age: 1h
wal:
dir: /var/loki/wal
limits_config:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_concurrent_tail_requests: 20
max_cache_freshness_per_query: 10m
retention_period: 744h
schema_config:
configs:
- from: 2020-09-07
store: boltdb-shipper
object_store: aws
schema: v11
index:
prefix: loki_index_
period: 24h
storage_config:
aws:
s3: {{ .Values.storageConfig.aws.s3 }}
endpoint: {{ .Values.storageConfig.aws.endpoint }}
access_key_id: {{ .Values.storageConfig.aws.access_key_id }}
secret_access_key: {{ .Values.storageConfig.aws.secret_access_key }}
region: {{ .Values.storageConfig.aws.region }}
s3forcepathstyle: true
boltdb_shipper:
active_index_directory: /var/loki/index
shared_store: s3
cache_location: /var/loki/cache
{{- if .Values.indexGateway.enabled }}
index_gateway_client:
server_address: dns:///{{ include "loki.indexGatewayFullname" . }}:9095
{{- end }}
querier:
query_timeout: 1m
engine:
timeout: 5m
query_range:
# make queries more cache-able by aligning them with their step intervals
align_queries_with_step: true
max_retries: 5
# parallelize queries in 15min intervals
split_queries_by_interval: 30m
cache_results: true
results_cache:
cache:
enable_fifocache: true
fifocache:
max_size_items: 1024
validity: 24h
frontend_worker:
frontend_address: {{ include "loki.queryFrontendFullname" . }}:9095
frontend:
log_queries_longer_than: 5s
compress_responses: true
tail_proxy_url: http://{{ include "loki.querierFullname" . }}:3100
compactor:
working_directory: /var/loki/compactor
shared_store: s3
compaction_interval: 5m
retention_enabled: true
compactor_ring:
kvstore:
store: memberlist