Logging a million lines per minute

We have a particular set of microservices that log huge amounts of text for regulatory reasons. We are using loki in microservices mode with seperate replicasets for ingesters/queriers/gateway/distributor/frontend etc.

I have inherited this deployment and have found that loki can timesout when querying an 0.5 hour time period in grafana, despite the timeout being set to 1 minute.

Where can i find the best settings for the config for a use case like this? Is there anywhere i can find concrete guidelines?

Hello,

have a look(i) (sorry could not resist…) at Best practices | Grafana Labs

I think in particular getting chunk_target_size right would be a first step.

Other posts I have looked at are

This is also from a post but I have only copied the text into my own Loki optimization doc…

increasing of parameter query_range: split_queries_by_interval: 24h decreases total time to 1-1.3 min (for 30 days range)
(previously 4.5-5 min)

This is a config for query frontend. Not sure what a good value for your use case would be though. Probably not 24h…

Hope that helps.

Thanks for your reply, yes i’ve already tried to implement everything stated in the docs, however we still get inconsistent query results in terms of timeouts when selecting a 3 hour + time period.

Here is my config.

compactor:
  enabled: true
  resources:
  limits:
    cpu: 500m
    memory: 128Mi
  requests:
    cpu: 500m
    memory: 128Mi
  nodeSelector:
    lifecycle: spot

gateway:
  nodeSelector:
    lifecycle: spot
  replicas: 6
  ingress:
    enabled: true
    ingressClassName: redacted
    annotations:
      # for older clusters
      # kubernetes.io/ingress.class: redacted
      nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
      # this might help with timeouts needs testing
      nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    hosts:
      - host: redacted
        paths:
          - path: /
            pathType: Prefix
    tls: []
  nginxConfig:
    httpSnippet: |
      proxy_read_timeout 600;
      proxy_connect_timeout 600;
      proxy_send_timeout 600;

loki:
  config: |
    auth_enabled: false
    compactor:
      shared_store: s3

    server:
      log_level: info
      http_listen_port: 3100
      # this stops received message larger than max error
      # number was double what the default is yolo
      grpc_server_max_recv_msg_size: 20730922
      grpc_server_max_send_msg_size: 20730922

    distributor:
      ring:
        kvstore:
          store: memberlist

    memberlist:
      join_members:
        - loki-distributed-memberlist

    ingester:
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 3
      chunk_target_size: 1536000
      chunk_idle_period: 15m
      chunk_block_size: 262144
      chunk_encoding: snappy
      chunk_retain_period: 5m
      max_transfer_retries: 0
      wal:
        enabled: true
        dir: /var/loki/wal/
        replay_memory_ceiling: 10GB


    limits_config:
      ingestion_rate_mb: 1000
      enforce_metric_name: false
      reject_old_samples: false
      reject_old_samples_max_age: 24h
      max_cache_freshness_per_query: 10m
      max_concurrent_tail_requests: 200
      max_query_parallelism: 96
      max_streams_per_user: 40000
      per_stream_rate_limit: 800MB
      cardinality_limit:  300000

    schema_config:
      configs:
        - from: 2020-09-07
          store: boltdb-shipper
          object_store: aws
          schema: v11
          index:
            prefix: loki_v2_index_
            period: 24h

    storage_config:
      index_queries_cache_config:
        enable_fifocache: false
        memcached:
          expiration: 24h
          batch_size: 100
          parallelism: 200
        memcached_client:
          consistent_hash: true
          host: loki-distributed-memcached-index-queries
          service: http
      boltdb_shipper:
        active_index_directory: /var/loki/indexv2
        shared_store: s3
        cache_location: /var/loki/cache
        cache_ttl: 168h
      filesystem:
        directory: /var/loki/chunks
      aws:
        s3: s3://redacted
        bucketnames: redacted
        sse_encryption: true

    chunk_store_config:
      chunk_cache_config:
        enable_fifocache: false
        memcached:
          expiration: 2h
          batch_size: 100
          parallelism: 200
        memcached_client:
          consistent_hash: true
          host: loki-distributed-memcached-chunks
          service: http
      max_look_back_period: 0s

    table_manager:
      retention_deletes_enabled: false
      retention_period: 0s

    query_range:
      align_queries_with_step: true
      max_retries: 5
      split_queries_by_interval: 10m
      parallelise_shardable_queries: true
      cache_results: true
      results_cache:
        cache:
          enable_fifocache: true
          memcached_client:
            consistent_hash: true
            host: loki-distributed-memcached-chunks
            max_idle_conns: 16
            service: http
            timeout: 500ms
            update_interval: 1m

    querier:
      query_ingesters_within: 3h
      query_timeout: 10m
      tail_max_duration: 24h

    frontend_worker:
      frontend_address: loki-distributed-query-frontend:9095
      parallelism: 6 #6 cores available

    frontend:
      log_queries_longer_than: 30s
      compress_responses: true
      max_outstanding_per_tenant: 1024

distributor:
  replicas: 4
  resources:
    limits:
      cpu: 500m
      memory: 256Mi
    requests:
      cpu: 500m
      memory: 256Mi
  nodeSelector:
    lifecycle: spot

ingester:
  persistence:
    enabled: true
    storageClass: gp2
    size: 30G
  resources:
    limits:
      cpu: 2000m
      memory: 14Gi
    requests:
      cpu: 2000m
      memory: 14Gi
  replicas: 6
  nodeSelector:
    lifecycle: spot

memcachedChunks:
  enabled: true
  replicas: 8
  extraArgs:
    - -m 19000
    - -I 10m
    - -vvv
  resources:
    requests:
      cpu: 1000m
      memory: 20Gi
    limits:
      cpu: 1000m
      memory: 20Gi
  nodeSelector:
    lifecycle: spot

memcachedIndexQueries:
  replicas: 4
  resources:
    requests:
      cpu: 500m
      memory: 2Gi
    limits:
      cpu: 500m
      memory: 2Gi
  enabled: true
  nodeSelector:
    lifecycle: spot
memcachedExporter:
  enabled: true
  resources:
    limits:
      cpu: 100m
      memory: 50Mi
    requests:
      cpu: 100m
      memory: 50Mi

queryFrontend:
  replicas: 3
  resources:
    requests:
      cpu: 200m
      memory: 256Mi
    limits:
      cpu: 200m
      memory: 256Mi
  nodeSelector:
    lifecycle: spot

querier:
  resources:
    requests:
      cpu: 2000m
      memory: 2Gi
    limits:
      cpu: 6000m
      memory: 2Gi
  replicas: 16
  nodeSelector:
    lifecycle: spot

memcachedFrontend:
  nodeSelector:
    lifecycle: spot
  serviceMonitor:
    enabled: true
    labels:
      release: kube-prometheus-stack
    interval: 30s

serviceAccount:
  create: true
  name: "loki-distributed"
  annotations:
    eks.amazonaws.com/role-arn: redacted

Query example:

count_over_time({job="logging/logtest"} [1s]) for a 3-24 hour period.
something like this will often timeout with a 502 when done via the logcli or a 504 when run via grafana.