Loki Query Performance

Currently we’re using Loki and Fluentbit to shipping logs from our third party application. The application can produce ~400k/5min logs. We’re using loki-distributed on our cluster with 3 shared nodes for monitoring stuff(4CPUs, 32GB ram), here is our current config.

---
loki:
  auth_enabled: false
  schemaConfig:
    configs:
      - from: 2024-10-10
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  ingester:
    chunk_encoding: snappy
    chunk_idle_period: 1h
    chunk_target_size: 1536000
    max_chunk_age: 1h
    wal:
      enabled: false

  tracing:
    enabled: true

  querier:
    # Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
    max_concurrent: 8

  grpc_client:
    max_send_msg_size: 999999999999 
    max_recv_msg_size: 999999999999 

  compactor:
    apply_retention_interval: 24h
    compaction_interval: 5m
    retention_delete_worker_count: 1000
    retention_enabled: true
    retention_delete_delay: 2h
    working_directory: /var/loki/data/compactor
    delete_request_store: aws

  frontend_worker:
    grpc_client_config:
      max_send_msg_size: 999999999999 
      max_recv_msg_size: 999999999999 
  server:
    http_server_read_timeout: 1200s # allow longer time span queries
    http_server_idle_timeout: 1200s
    http_server_write_timeout: 1200s # allow longer time span queries
    grpc_server_max_recv_msg_size: 999999999999 
    grpc_server_max_send_msg_size: 999999999999 
    grpc_server_max_concurrent_streams: 10000000
    grpc_server_max_connection_age: 15m
    grpc_server_max_connection_age_grace: 15m
    grpc_server_max_connection_idle: 10m

  limits_config:
    max_query_series: 1000000
    reject_old_samples: true
    reject_old_samples_max_age: 336h
    retention_period: 720h
    max_query_parallelism: 100
    max_entries_limit_per_query: 1000000
    max_global_streams_per_user: 0
    query_timeout: 30m
    split_queries_by_interval: 30m
    unordered_writes: true
    shard_streams:
      enabled: true

  storage:
    bucketNames:
      chunks: loki-s3-data-storage
      ruler: loki-s3-data-storage
    type: s3
    s3:
      region: eu-west-2
      endpoint: s3.eu-west-2.amazonaws.com
      s3forcepathstyle: false
      insecure: false
      accessKeyId: "${AWS_ACCESS_KEY_ID}"
      secretAccessKey: "${AWS_SECRET_ACCESS_KEY}"
deploymentMode: Distributed

ingester:
  replicas: 3
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

querier:
  replicas: 3
  maxUnavailable: 2
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

queryFrontend:
  replicas: 2
  maxUnavailable: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

queryScheduler:
  replicas: 2
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

distributor:
  replicas: 3
  maxUnavailable: 2
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

compactor:
  replicas: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

indexGateway:
  replicas: 2
  maxUnavailable: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

ruler:
  replicas: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

resultsCache:
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring

chunksCache:
  batchSize: 10
  parallelism: 10
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring

monitoring:
  dashboards:
    enabled: true
  rules:
    enabled: true
  serviceMonitor:
    labels:
      release: kube-prometheus-stack
    namespaceSelector:
      matchNames:
        - monitoring
    enabled: true

Here is the log format in json

{"Rate_Limited":"null","Rate_Limit_Log":"null","asn":"5645","client_ip":"22.26.13.157","fastly_is_edge":true,"fastly_server":"cache-yyz4550-YYZ","geo_city":"north york","geo_country":"canada","host":"uwu.com","request_method":"GET","request_protocol":"HTTP/2","request_referer":"https://uwu.com/?utm_source=propellerads\u0026utm_campaign=popunders_win_Canada\u0026utm_medium=paid","request_user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:132.0) Gecko/20100101 Firefox/132.0","response_body_size":78,"response_reason":"Found","response_state":"HIT-CLUSTER","response_status":302,"timestamp":"2024-11-05T05:07:05+0000","url":"/resized-image?width=23\u0026height=16\u0026game_code=evo_koreanspeakingspeedbaccarat"}

Here is our fluentbit config to forward the logs to our loki.

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentbit-uwu-config
  namespace: monitoring
  labels:
    app.kubernetes.io/instance: fluentbit-fastly
    app.kubernetes.io/name: fluentbit-uwu
    app.kubernetes.io/version: 2.7.2
data:
  labels-map.json: |
    {
      "Rate_Limited": "Rate_Limited",
      "asn": "asn",
      "client_ip": "client_ip",
      "geo_country": "geo_country",
      "host": "host",
      "request_method": "request_method",
      "request_user_agent": "request_user_agent",
      "response_state": "response_state",
      "response_status": "response_status",
      "url": "url"
    }
  fluent-bit.conf: |
    [SERVICE]
      HTTP_Server on
      HTTP_Listen 0.0.0.0
      HTTP_PORT 2020
    [INPUT]
      Name http
      Listen 0.0.0.0
      Port 8082
      Tag uwu
    [OUTPUT]
      Name grafana-loki
      Match uwu
      Url http://loki-gateway/api/prom/push
      RemoveKeys source
      Labels {job="fastly-cdn-uwu"}
      BatchWait 1
      BatchSize 524288
      LineFormat json
      LogLevel info
---

At first we tried to add more labels to our logs to make it more easier to query, but we got an error in our loki like this. And we remove the additional labels.

level=warn caller=client.go:379 id=0 component=client host=loki-gateway msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Maximum active stream limit exceeded, reduce the number of active streams (reduce labels or reduce label values), or contact your Loki administrator to see if the limit can be increased, user: 'fake'"

So, we tried some LogQL to visualize our data, it works well with small data, but we got 504 timeout the querier pods got OOM when we use our prod data when the number of logs is around 400k/5min.

## Get top 15 URL
topk(15, sum by (url) (count_over_time({job="$fastly_cdn"} | json [30m])))

## Get top 15 ASN
topk(15, sum by (asn) (count_over_time({job="$fastly_cdn"} | json [$__interval])))

## Get top Visitor
topk(15 ,sum by (client_ip, geo_country)(count_over_time({job="$fastly_cdn"} | json | __error__ = "" [30m])))

## Req per status code
sum by (response_status) (count_over_time({job="$fastly_cdn"} | json | response_status=~"200|401|403|500|501|503|0" | __error__="" [$__interval]))

So, the question are.

  1. Is there any recommendation to optimize the query performance(we aim to have a 1 month data to visualize, the data ~3.45 billion)?
  2. Is there any resource/architecture recommendation? Maybe we need to increase our nodes?
1 Like

Query performance for Loki primarily comes from distribution. Some things for you to try:

  1. First you absolutely need to enable and configure query frontend correctly so query splitting works properly. See Query frontend example | Grafana Loki documentation. I recommend using the pull method by setting frontend_address (which would be the internal address of your query frontend containers, or read containers if you are running simple scalable deployment).
  2. I’d recommend a bigger value for split_queries_by_interval. Set it to 1h or 2h.
  3. You’ll need more queriers (or read containers if SSD). You’ll have to test this a bit or run some calculation. For example, let’s say you have query splitting working correctly and scale up to 10 read containers. You should run a performance test by using logql and query for say two days with bogus query, and see how much data you are able to process per second, and see if that’ll fit within your timeout window, and decide whether to scale further.
  4. If you have tons of logs you might consider using multiple S3 buckets. I’d only do this if you actually run into rate limiting errors from S3.
  5. You may also want to consider extending your max chunk size a bit (we do 3h), so you end up with less small files written into S3. Remember to adjust query_ingester_within as well if you do change this.

Hey @tonyswumac ,

Thanks for your suggestions, here is my current config, now I have like 6 nodes(4vCPUs, 8GB RAM).

---
loki:
  auth_enabled: false
  schemaConfig:
    configs:
      - from: 2024-10-10
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  ingester:
    chunk_encoding: snappy
    chunk_idle_period: 1h
    chunk_target_size: 1536000
    max_chunk_age: 3h
    wal:
      enabled: false

  tracing:
    enabled: true

  querier:
    # Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
    max_concurrent: 4

  grpc_client:
    max_send_msg_size: 999999999999 
    max_recv_msg_size: 999999999999 

  compactor:
    apply_retention_interval: 24h
    compaction_interval: 5m
    retention_delete_worker_count: 1000
    retention_enabled: true
    retention_delete_delay: 2h
    working_directory: /var/loki/data/compactor
    delete_request_store: aws

  frontend:
    log_queries_longer_than: 30s
    compress_responses: true

  frontend_worker:
    grpc_client_config:
      max_send_msg_size: 999999999999 
      max_recv_msg_size: 999999999999 
  server:
    http_server_read_timeout: 1200s # allow longer time span queries
    http_server_idle_timeout: 1200s
    http_server_write_timeout: 1200s # allow longer time span queries
    grpc_server_max_recv_msg_size: 999999999999 
    grpc_server_max_send_msg_size: 999999999999 
    grpc_server_max_concurrent_streams: 10000000
    grpc_server_max_connection_age: 15m
    grpc_server_max_connection_age_grace: 15m
    grpc_server_max_connection_idle: 10m

  limits_config:
    max_query_series: 1000000
    reject_old_samples: true
    reject_old_samples_max_age: 336h
    retention_period: 720h
    max_query_parallelism: 100
    max_entries_limit_per_query: 1000000
    max_global_streams_per_user: 0
    query_timeout: 30m
    split_queries_by_interval: 2h
    unordered_writes: true
    shard_streams:
      enabled: true

  storage:
    bucketNames:
      chunks: uwu-prod-s3-loki-data-storage
      ruler: uwu-prod-s3-loki-data-storage
    type: s3
    s3:
      region: eu-west-2
      endpoint: s3.eu-west-2.amazonaws.com
      s3forcepathstyle: false
      insecure: false
      accessKeyId: "${AWS_ACCESS_KEY_ID}"
      secretAccessKey: "${AWS_SECRET_ACCESS_KEY}"

deploymentMode: Distributed

ingester:
  replicas: 6
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

querier:
  replicas: 6
  maxUnavailable: 2
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

queryFrontend:
  replicas: 3
  maxUnavailable: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

queryScheduler:
  replicas: 3
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

distributor:
  replicas: 3
  maxUnavailable: 2
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

compactor:
  replicas: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

indexGateway:
  replicas: 2
  maxUnavailable: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

ruler:
  replicas: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

resultsCache:
  nodeSelector:
    group: monitoring-cache

chunksCache:
  batchSize: 10
  parallelism: 10
  nodeSelector:
    group: monitoring-cache

monitoring:
  dashboards:
    enabled: true
  rules:
    enabled: true
  serviceMonitor:
    labels:
      release: kube-prometheus-stack
    namespaceSelector:
      matchNames:
        - monitoring
    enabled: true

But I still get timeout with this much data ~400k/5min logs. Can you please help again?

See #1 from my previous reply. Your query frontend is not configured properly, so query splitting isn’t working. Read the documentation as well for examples.

I recommend using pull mode for query frontend, something like this:

frontend_worker:
  frontend_address: <loki_frontend_internal_address>:<loki_grpc_port>

loki_frontend_internal_address is the internal service discovery address for your query frontend pods.

While trying to add that address I got this error.

frontend address and scheduler address are mutually exclusive, please use only one

I checked the helm values file, and if I use the Distributed mode, I’ll also add the scheduler address here.
So, I tried to use the first approach on the pull-mode docs by specifying this.

  • Specify --frontend.downstream-url or its YAML equivalent, frontend.downstream_url. This proxies requests over HTTP to the specified URL.

Here is my full values file.

---
loki:
  auth_enabled: false
  schemaConfig:
    configs:
      - from: 2024-10-10
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  ingester:
    chunk_encoding: snappy
    chunk_idle_period: 1h
    chunk_target_size: 1536000
    max_chunk_age: 3h
    wal:
      enabled: false

  tracing:
    enabled: true

  querier:
    # Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
    max_concurrent: 4

  grpc_client:
    max_send_msg_size: 999999999999
    max_recv_msg_size: 999999999999

  compactor:
    apply_retention_interval: 24h
    compaction_interval: 5m
    retention_delete_worker_count: 1000
    retention_enabled: true
    retention_delete_delay: 2h
    working_directory: /var/loki/data/compactor
    delete_request_store: aws

  frontend:
    log_queries_longer_than: 30s
    downstream_url: http://loki-querier.monitoring.svc.cluster.local:3100
    compress_responses: true

  frontend_worker:
    grpc_client_config:
      max_send_msg_size: 999999999999
      max_recv_msg_size: 999999999999
    # frontend_address: "loki-query-frontend.monitoring.svc.cluster.local:9095"
    # scheduler_address: ""
  server:
    http_server_read_timeout: 1200s # allow longer time span queries
    http_server_idle_timeout: 1200s
    http_server_write_timeout: 1200s 
    grpc_server_max_recv_msg_size: 999999999999 
    grpc_server_max_send_msg_size: 999999999999 
    grpc_server_max_concurrent_streams: 10000000
    grpc_server_max_connection_age: 15m
    grpc_server_max_connection_age_grace: 15m
    grpc_server_max_connection_idle: 10m

  limits_config:
    max_query_series: 1000000
    reject_old_samples: true
    reject_old_samples_max_age: 336h
    retention_period: 720h
    max_query_parallelism: 100
    max_entries_limit_per_query: 1000000
    max_global_streams_per_user: 0
    query_timeout: 30m
    split_queries_by_interval: 2h
    unordered_writes: true
    shard_streams:
      enabled: true

  storage:
    bucketNames:
      chunks: uwuu-prod-s3-loki-data-storage
      ruler: uwuu-prod-s3-loki-data-storage
    type: s3
    s3:
      region: eu-west-2
      endpoint: s3.eu-west-2.amazonaws.com
      s3forcepathstyle: false
      insecure: false
      accessKeyId: "${AWS_ACCESS_KEY_ID}"
      secretAccessKey: "${AWS_SECRET_ACCESS_KEY}"

deploymentMode: Distributed

ingester:
  replicas: 6
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

querier:
  replicas: 6
  maxUnavailable: 2
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

queryFrontend:
  replicas: 3
  maxUnavailable: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

queryScheduler:
  replicas: 3
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

distributor:
  replicas: 3
  maxUnavailable: 2
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

compactor:
  replicas: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

indexGateway:
  replicas: 2
  maxUnavailable: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

ruler:
  replicas: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

resultsCache:
  nodeSelector:
    group: monitoring-cache

chunksCache:
  batchSize: 10
  parallelism: 10
  nodeSelector:
    group: monitoring-cache

monitoring:
  dashboards:
    enabled: true
  rules:
    enabled: true
  serviceMonitor:
    labels:
      release: kube-prometheus-stack
    namespaceSelector:
      matchNames:
        - monitoring
    enabled: true

# optional experimental components
bloomPlanner:
  replicas: 0
bloomBuilder:
  replicas: 0
bloomGateway:
  replicas: 0

# Enable minio for storage
minio:
  enabled: false
  persistence:
    size: 5Gi

# Zero out replica counts of other deployment modes
backend:
  replicas: 0
read:
  replicas: 0
write:
  replicas: 0

singleBinary:
  replicas: 0

test:
  enabled: false
lokiCanary:
  enabled: false

But again, still got timed-out while trying to run this query.

sum by (response_status) (count_over_time({job="fastly-cdn-uwuu"} | json | response_status=~"200|401|403|500|501|503|0" | __error__="" [3h]))
  1. I believe the documentation recommends using frontend_worker.frontend_address, instead of downstream_url. I would try that.

  2. If you look at logs from your frontend container, do you see evidence of query splitting? You should see labels such as split=0 or split=1 in the frontend logs.

  3. Queries, what’s the time frame for your query? Are you running it for the last day? past couple of days? past week?

Hey @tonyswumac ,

I tried to use the frontend_worker.frontend_address, here is my current config.

---
loki:
  auth_enabled: false
  schemaConfig:
    configs:
      - from: 2024-10-10
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  ingester:
    chunk_encoding: snappy
    chunk_idle_period: 1h
    chunk_target_size: 1536000
    max_chunk_age: 3h
    wal:
      enabled: false

  tracing:
    enabled: true

  query_scheduler:
    max_outstanding_requests_per_tenant: 32768

  querier:
    # Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
    max_concurrent: 4

  grpc_client:
    max_send_msg_size: 999999999999 #1073741824 #209715200
    max_recv_msg_size: 999999999999 #1073741824 #209715200

  compactor:
    apply_retention_interval: 24h
    compaction_interval: 5m
    retention_delete_worker_count: 1000
    retention_enabled: true
    retention_delete_delay: 2h
    working_directory: /var/loki/data/compactor
    delete_request_store: aws

  frontend:
    log_queries_longer_than: 30s
    compress_responses: true
    scheduler_address: ""

  frontend_worker:
    grpc_client_config:
      max_send_msg_size: 999999999999 #1073741824 #209715200
      max_recv_msg_size: 999999999999 #1073741824 #209715200
    frontend_address: "loki-query-frontend.monitoring.svc.cluster.local:9095"
    scheduler_address: ""

  server:
    http_server_read_timeout: 1200s # allow longer time span queries
    http_server_idle_timeout: 1200s
    http_server_write_timeout: 1200s # allow longer time span queries
    grpc_server_max_recv_msg_size: 999999999999 
    grpc_server_max_send_msg_size: 999999999999 
    grpc_server_max_concurrent_streams: 10000000
    grpc_server_max_connection_age: 15m
    grpc_server_max_connection_age_grace: 15m
    grpc_server_max_connection_idle: 10m

  limits_config:
    max_query_series: 1000000
    reject_old_samples: true
    reject_old_samples_max_age: 336h
    retention_period: 720h
    max_query_parallelism: 100
    max_entries_limit_per_query: 1000000
    max_global_streams_per_user: 0
    query_timeout: 30m
    split_queries_by_interval: 2h
    unordered_writes: true
    ingestion_burst_size_mb: 200
    ingestion_rate_mb: 100
    ingestion_rate_strategy: local
    per_stream_rate_limit: 100M
    per_stream_rate_limit_burst: 200M
    shard_streams:
      enabled: true

  storage:
    bucketNames:
      chunks: uwu-prod-s3-loki-data-storage
      ruler: uwu-prod-s3-loki-data-storage
    type: s3
    s3:
      region: eu-west-2
      endpoint: s3.eu-west-2.amazonaws.com
      s3forcepathstyle: false
      insecure: false
      accessKeyId: "${AWS_ACCESS_KEY_ID}"
      secretAccessKey: "${AWS_SECRET_ACCESS_KEY}"

deploymentMode: Distributed

ingester:
  replicas: 6
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

querier:
  replicas: 6
  maxUnavailable: 2
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

queryFrontend:
  replicas: 3
  maxUnavailable: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

queryScheduler:
  replicas: 3
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

distributor:
  replicas: 3
  maxUnavailable: 2
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

compactor:
  replicas: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

indexGateway:
  replicas: 2
  maxUnavailable: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

ruler:
  replicas: 1
  tolerations:
    - key: "group"
      operator: "Equal"
      value: "monitoring"
      effect: "PreferNoSchedule"
  nodeSelector:
    group: monitoring
  extraArgs:
    - "-config.expand-env=true"
  extraEnv:
    - name: GRAFANA_LOKI_S3_ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_endpoint
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_accesskeyid
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: loki-s3-secrets
          key: grafana_loki_s3_secretaccesskey

gateway:
  enabled: true
  replicas: 2
  ingress:
    enabled: true
    ingressClassName: "nginx"
    hosts:
      - host: lok-gateway.uwu.com
        paths:
          - path: /
            pathType: Prefix
    tls:
      - secretName: uwu-cert
        hosts:
          - lok-gateway.uwu.com

resultsCache:
  nodeSelector:
    group: monitoring-cache

chunksCache:
  batchSize: 10
  parallelism: 10
  nodeSelector:
    group: monitoring-cache

serviceMonitor:
  enabled: false
  labels:
    release: kube-prometheus-stack
  namespaceSelector:
    matchNames:
      - monitoring

monitoring:
  dashboards:
    enabled: true
  rules:
    enabled: false
  serviceMonitor:
    labels:
      release: kube-prometheus-stack
    namespaceSelector:
      matchNames:
        - monitoring
    enabled: false

# optional experimental components
bloomPlanner:
  replicas: 0
bloomBuilder:
  replicas: 0
bloomGateway:
  replicas: 0

# Enable minio for storage
minio:
  enabled: false
  persistence:
    size: 5Gi

# Zero out replica counts of other deployment modes
backend:
  replicas: 0
read:
  replicas: 0
write:
  replicas: 0

singleBinary:
  replicas: 0

test:
  enabled: false
lokiCanary:
  enabled: false

I can see labels split=0 or split=1 in my logs.

level=info ts=2024-11-24T08:47:51.376356836Z caller=metrics.go:217 component=frontend org_id=fake traceID=705b35b2e8010ba2 latency=fast query="{namespace=\"monitoring\"} |= ``" query_hash=2983721522 query_type=filter range_type=range length=1h0m0s start_delta=1h0m5.818346386s end_delta=5.818346527s step=1h0m0s duration=165.412727ms status=200 limit=10 returned_lines=0 throughput=34MB total_bytes=5.7MB total_bytes_structured_metadata=306kB lines_per_second=157805 total_lines=26103 post_filter_lines=26103 total_entries=10 store_chunks_download_time=142.406282ms queue_time=0s splits=1 shards=1 query_referenced_structured_metadata=false pipeline_wrapper_filtered_lines=0 chunk_refs_fetch_time=3.520144ms cache_chunk_req=10 cache_chunk_hit=8 cache_chunk_bytes_stored=543993 cache_chunk_bytes_fetched=231566 cache_chunk_download_time=1.28778ms cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_volume_results_req=0 cache_volume_results_hit=0 cache_volume_results_download_time=0s cache_result_req=1 cache_result_hit=0 cache_result_download_time=460.208µs cache_result_query_length_served=0s ingester_chunk_refs=0 ingester_chunk_downloaded=0 ingester_chunk_matches=84 ingester_requests=6 ingester_chunk_head_bytes=3.9MB ingester_chunk_compressed_bytes=506kB ingester_chunk_decompressed_bytes=0B ingester_post_filter_lines=17755 congestion_control_latency=0s index_total_chunks=0 index_post_bloom_filter_chunks=0 index_bloom_filter_ratio=0.00 index_shard_resolver_duration=0s source=datasample disable_pipeline_wrappers=false

But my loki-querier and loki-frontend-querier keep getting Evicted due to memory usage. Only try to filter the last 12h and intermittently got 504 Timeout, especially when run query to get top URL.

topk(15, sum by (url) (count_over_time({job="fastly-prod-logs"} | json [12h])))

Do you have suggestion please @tonyswumac ?

Thank you

If you see your queries being split then you should probably start doing a bit of calculation.

  1. How much log are you actually trying to query against?
  2. Have you looked at your cluster metrics and identify where resource pressure are coming from?
  3. If you use logcli and run a query for 24 or 48 hours, it should give you a bunch of metrics you can look at to sort of tell how much data you are trying to process, such as bytes processed per second, lines processed per second, total bytes processed, bytes downloaded, and execution time.