Loki performance and missing logs issues

michelsellam · July 30, 2024, 2:41pm

Greetings,

And first and foremost, let me please thank you for the great work you guys are putting into Loki and the whole suite in general !

We’ve been trying to integrate grafana Loki into our set up. So far, we had our own log collection system, based on SQL.

We’re super thrill to make this happen! Unfortunately, in the last couple of months we’ve put efforts into it, we did not get to that.

Our set up

In production we are ingesting 50G of logs a day. 20% of them are passing the 128k of max_lines and we so trunk them on loki’s side.
Deployment type: SSD. We use loki 3.1
We have 5 m6a.2xlarge AWS nodes on a EKS 1.28 cluster’s nodegroup, just for the loki Helm chart.
We have 5 replicas for: read, write and backend.
We moved from EFS to EBS with gp3 provisioned with: iops = "500", and

throughput = "125"
We use structured_metadata with trace_id as label ( see issues below)
We use TSDB since we started to test Loki. We’re on v13
Our limits_config block:

  limits_config:
    reject_old_samples_max_age: 30d
    max_line_size_truncate: true
    split_queries_by_interval: 1h
    query_timeout: 10m
    tsdb_max_query_parallelism: 100
    retention_period: 90d
    retention_stream:
      - selector: '{log_stream="chargebee"}'
        period: 372d
        priority: 100

We turned on both caches in the charts:

chunksCache:
  nodeSelector: { Usage: logs }
  replicas: 3
  batchSize: 256
  parallelism: 10
  allocatedMemory: 4096
  connectionLimit: 1024
  maxItemMemory: 2m

resultsCache:
  nodeSelector: { Usage: logs }
  replicas: 3
  default_validity: 12h

We’ve diabled all the meta-monitoring abilities.
We are using compactor too.

  compactor:
    # Activate custom (per-stream,per-tenant) retention.
    retention_enabled: true
    working_directory: /retention
    delete_request_store: s3
    retention_delete_delay: 2h
    retention_delete_worker_count: 150
    compaction_interval: 10m

Automatic stream sharding is turned on and seems to work.

We are hitting mostly 2 major issues:

In one of our scenarios, we are using loki to run a query such as:

{namespace="default", channel=~"yyyy"} | json trace_id="trace_id" | trace_id="1716130200.FsjYw8PUjq0lduNxLyr"

  mentioned sometimes as a “needle in a haystack”. We ran into extremely high CPU         rates accross all the nodes. On a `30d` time range we also generally experience a         time out in the result on `grafana` .

We use promtail for ingestion. We remark that it returns HTTP 429 when the retry mechanism kicks off, and after some time testing loki compared to our legacy system: we found out loki is missing some of the log lines. We suspect the threshold of retries to be passed - in the case those lines would be dropped.We are still investigating…

On the loki side, at first we’ve been seeing the following errors practically all the time on the ingester logs:

Ingestion rate limit exceeded for user fake (limit: 838860 bytes/sec) while attempting to ingest '229' lines totaling '764851' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased

And after playing a lot with the limit_configs, we just realized that we needed to scale up our nodes vertically one notch to make them dissapear.

Which it did, but we still occasionally “loosing” log lines…And this is quite problematic in our eyes.

As mentioned, we’ve been working on loki for quite a while and reached a dead end with those issues. If there is anything else we can/should do to make this better, please advice.

Last point: we are aware of the new bloom filters experimental feature. We would like to wait with it, and exhaust every other options first.

nb: We are still not working with Prometheus…And so we have no access to loki’s metrics ( for now …)

Thank you :]

M.

tonyswumac · July 30, 2024, 4:48pm

Can you share your entire Loki configuration, please?
You can adjust max_line_size as well to accept bigger logs.
You may want to adjust grpc message size for server components in general:

server:
  # 100MB
  grpc_server_max_recv_msg_size: 1.048576e+08
  grpc_server_max_send_msg_size: 1.048576e+08

You may find better success with smaller reader containers but have more of them.
In general if logs are missing you should see in Loki’s log to see why it’s happening. If you are using promtail, promtail has a backoff setting that will have exponential backoff timeout configuration. We have this for all our promtail agents that can store up to 30 minutes of logs locally if Loki were unavailable.

michelsellam · July 31, 2024, 10:00am

Hello Tony

Thank you for jumping on this !

I am looking into your suggestions, so far, and will report back later on what I’ve found out.

I have one essential question though, regarding number 4: " You may find better success with smaller reader containers but have more of them.".
How does this translate in term of deployment specs? Should I limit the reads pods resources using read.resources.limits.cpu / read.resources.limits.memory, and have more replicas ?
Is this what you meant by “smaller” ?
Either way, the pod anti-affinity rule used won’t allow me scheduling more than one pod on those nodes…

   affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/component: read

If you believe using the micro services approach is better in our case, please let me know. In my thinking, we should be able to work it out with the current set, but I may be wrong !

Our full config ( after a bit of masking important stuff…):

global:
  dnsService: 'kube-dns'
write:
  nodeSelector: { Usage: logs }
  persistence:
    storageClass: 'loki-gp3-sc'
  replicas: 5
read:
  nodeSelector: { Usage: logs }
  persistence:
    storageClass: 'loki-gp3-sc'
  replicas: 5
backend:
  replicas: 5
  nodeSelector: { Usage: logs }
  extraVolumeMounts:
    - mountPath: /retention
      name: retention
  extraVolumes:
    - emptyDir: {}
      name: retention
  persistence:
    storageClass: 'loki-gp3-sc'
gateway:
  nodeSelector: { Usage: logs }

loki:
  auth_enabled: false
  storage:
    type: 's3'
    s3:
      region: eu-central-1
    bucketNames:
      chunks: prod-loki-log-streams
      ruler: prod-loki-log-streams
      admin: prod-loki-log-streams
  limits_config:
    reject_old_samples_max_age: 30d
    max_line_size_truncate: true
    split_queries_by_interval: 1h
    query_timeout: 10m
    tsdb_max_query_parallelism: 100
    retention_period: 90d
    retention_stream:
      - selector: '{log_stream="ABCD"}'
        period: 372d
        priority: 100

  schemaConfig:
    configs:
      - from: 2024-04-30
        store: tsdb
        object_store: aws
        schema: v13
        index:
          prefix: index_
          period: 24h
  querier:
    max_concurrent: 16

  query_scheduler:
    max_outstanding_requests_per_tenant: 3276
  compactor:
    retention_enabled: true
    working_directory: /retention
    delete_request_store: s3
    retention_delete_delay: 2h
    retention_delete_worker_count: 150
    compaction_interval: 10m

  ingester:
    chunk_encoding: snappy
    autoforget_unhealthy: true

serviceAccount:
  name: loki-sa
  annotations:
    {
      eks.amazonaws.com/role-arn: 'arn:aws:iam::ABCD:role/loki-sa-role',
    }

chunksCache:
  nodeSelector: { Usage: logs }
  replicas: 3
  batchSize: 256
  parallelism: 10
  allocatedMemory: 4096
  connectionLimit: 1024
  maxItemMemory: 2m

resultsCache:
  nodeSelector: { Usage: logs }
  replicas: 3
  default_validity: 12h

monitoring:
  dashboards:
    enabled: false
  rules:
    enabled: false
  serviceMonitor:
    enabled: false
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false
lokiCanary:
  enabled: false
test:
  enabled: false

Lastly, regarding promtail - our configuration is pretty basic. I will definitely look into the backoff configs, and see how this play for us.

mikaljohn05 · August 2, 2024, 8:42pm

Thanks for sharing your experience! Have you checked Loki’s config settings and resource allocations? Sometimes tweaking those can significantly improve performance and Custom Wrestling Championship Belts resolve missing logs issues.

michelsellam · November 4, 2024, 6:25am

For the record:

Most of our our performance issues were solved by changing what we are ingesting ( and making our ingested log lines “thinner” )
There is no real magic here: there is no loki settings you can change that will drastically the way the system perform. when both the write and read path runs on the same nodes, the only way to make this work is to optimize what is getting ingested and labeled.

Thank you for your advice and recommendations over here.
M.

Topic		Replies	Views
Improving Performance in Loki System for Production Use Grafana Loki loki	3	3665	June 15, 2024
Slow query on larger logs Grafana Loki	3	31	February 13, 2025
Looking for advice on implementing Grafana Loki for log aggregation on edge servers Grafana Loki loki , logs	2	185	January 23, 2025
Some logs are missing in Loki Grafana Loki	3	1277	August 22, 2024
Recommended architecture for multiple environments Grafana Loki	4	1942	February 15, 2023

Loki performance and missing logs issues

Related topics