How to properly scale Loki queries with lots of data

Hello!

Recently we’ve deployed Loki in the Kubernetes cluster using loki-distributed chart. Currently the incoming amount of logs is about 10 GB per day - not a big amount to be fair. This amount will only grow in the future. It seems there are no problems in logs ingestion. The main problem we are facing is querying.

At first when the amount of daily logs was really small - there was absolutely no problems in executing any filter queries like this for a long period like 24h or even more

{env="prod",job="omni_services"} |= "some_id"

But as we approached even 10GB logs per day the query above can be executed only against up to 3h-6h period. Increasing the interval results in timeouts - and this is the main problem.

There are really a lot of moving parts in Loki and it’s a bit overwhelming what we should tweak and what could exactly have an effect.

The first tier was increasing dataproxy timeout in grafana.ini - we increased it up to 600s (5min)

grafana.ini:
  dataproxy:
    timeout: 600
    logging: false

But these 5m seems to be not enough for the query to complete.
The next thing we tweaked was the Loki’s querier config (increased engine.timeout to 5m)

querier:
      query_timeout: 1m
      engine:
        timeout: 5m

Currently we have 1 ingester, 1 distributor, 1 index-gateway, 1 query-frontend and 4 queriers.
Each of the queriers consumes CPU not more than 200-300mcpu - this is probably the main concern of ours - why queriers not using a lot of cpu.

Also we tried to tweak split_queries_by_interval setting but that seems like had no real impact. Currently we set it to 30m.

There are no particular errors in the queriers in logs - only when the request is cancelled by either nginx or grafana timeout.

So, the main problem and question is how to properly configure and scale Loki to be able to execute any reasonable queries against it. How can we understand what is the bottle neck of the entire setup. In other words, how to effectively speed up queries so we we are not hitting any timeouts.

We are using s3 storage for chunks and indexes.
Complete loki config is the following:

    auth_enabled: true

    server:
      log_level: info
      # Must be set to 3100
      http_listen_port: 3100
      grpc_server_max_recv_msg_size: 8388608  # 8 Mb
      grpc_server_max_send_msg_size: 8388608  # 8 Mb

    distributor:
      ring:
        kvstore:
          store: memberlist

    memberlist:
      join_members:
        - {{ include "loki.fullname" . }}-memberlist

    ingester:
      lifecycler:
        join_after: 0s
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1
      # Disable chunk transfer which is not possible with statefulsets
      # and unnecessary for boltdb-shipper
      max_transfer_retries: 0

      chunk_idle_period: 30m
      chunk_block_size: 262144
      chunk_target_size: 1536000
      chunk_encoding: snappy
      chunk_retain_period: 1m
      max_chunk_age: 1h
      wal:
        dir: /var/loki/wal

    limits_config:
      ingestion_rate_mb: 10
      ingestion_burst_size_mb: 20
      max_concurrent_tail_requests: 20
      max_cache_freshness_per_query: 10m
      retention_period: 744h

    schema_config:
      configs:
        - from: 2020-09-07
          store: boltdb-shipper
          object_store: aws
          schema: v11
          index:
            prefix: loki_index_
            period: 24h

    storage_config:
      aws:
        s3: {{ .Values.storageConfig.aws.s3 }}
        endpoint: {{ .Values.storageConfig.aws.endpoint }}
        access_key_id: {{ .Values.storageConfig.aws.access_key_id }}
        secret_access_key: {{ .Values.storageConfig.aws.secret_access_key }}
        region: {{ .Values.storageConfig.aws.region }}
        s3forcepathstyle: true
      boltdb_shipper:
        active_index_directory: /var/loki/index
        shared_store: s3
        cache_location: /var/loki/cache
      {{- if .Values.indexGateway.enabled }}
        index_gateway_client:
          server_address: dns:///{{ include "loki.indexGatewayFullname" . }}:9095
      {{- end }}

    querier:
      query_timeout: 1m
      engine:
        timeout: 5m

    query_range:
      # make queries more cache-able by aligning them with their step intervals
      align_queries_with_step: true
      max_retries: 5
      # parallelize queries in 15min intervals
      split_queries_by_interval: 30m
      cache_results: true

      results_cache:
        cache:
          enable_fifocache: true
          fifocache:
            max_size_items: 1024
            validity: 24h

    frontend_worker:
      frontend_address: {{ include "loki.queryFrontendFullname" . }}:9095

    frontend:
      log_queries_longer_than: 5s
      compress_responses: true
      tail_proxy_url: http://{{ include "loki.querierFullname" . }}:3100

    compactor:
      working_directory: /var/loki/compactor
      shared_store: s3
      compaction_interval: 5m
      retention_enabled: true
      compactor_ring:
        kvstore:
          store: memberlist

2 Likes

Hi!

How is your chunks utilization doing? You can check with the following metrics:

sum by(reason) (rate(loki_ingester_chunks_flushed_total[1h]))
/ ignoring(reason) group_left sum(rate(loki_ingester_chunks_flushed_total[1h]))
sum(rate(loki_ingester_chunk_utilization_sum[1h])) / sum(rate(loki_ingester_chunk_utilization_count[1h]))

Also, what is average chunks size in the storage?