Greetings,
And first and foremost, let me please thank you for the great work you guys are putting into Loki and the whole suite in general !
We’ve been trying to integrate grafana Loki into our set up. So far, we had our own log collection system, based on SQL.
We’re super thrill to make this happen! Unfortunately, in the last couple of months we’ve put efforts into it, we did not get to that.
Our set up
-
In production we are ingesting 50G of logs a day. 20% of them are passing the 128k of max_lines
and we so trunk them on loki’s side.
-
Deployment type: SSD. We use loki 3.1
-
We have 5 m6a.2xlarge
AWS nodes on a EKS 1.28 cluster’s nodegroup, just for the loki
Helm chart.
-
We have 5 replicas for: read, write and backend.
-
We moved from EFS to EBS with gp3
provisioned with: iops = "500"
, and
throughput = "125"
-
We use structured_metadata with trace_id
as label ( see issues below)
-
We use TSDB
since we started to test Loki. We’re on v13
-
Our limits_config block:
limits_config:
reject_old_samples_max_age: 30d
max_line_size_truncate: true
split_queries_by_interval: 1h
query_timeout: 10m
tsdb_max_query_parallelism: 100
retention_period: 90d
retention_stream:
- selector: '{log_stream="chargebee"}'
period: 372d
priority: 100
- We turned on both
caches
in the charts:
chunksCache:
nodeSelector: { Usage: logs }
replicas: 3
batchSize: 256
parallelism: 10
allocatedMemory: 4096
connectionLimit: 1024
maxItemMemory: 2m
resultsCache:
nodeSelector: { Usage: logs }
replicas: 3
default_validity: 12h
compactor:
# Activate custom (per-stream,per-tenant) retention.
retention_enabled: true
working_directory: /retention
delete_request_store: s3
retention_delete_delay: 2h
retention_delete_worker_count: 150
compaction_interval: 10m
- Automatic stream sharding is turned on and seems to work.
We are hitting mostly 2 major issues:
-
In one of our scenarios, we are using loki to run a query such as:
{namespace="default", channel=~"yyyy"} | json trace_id="trace_id" | trace_id="1716130200.FsjYw8PUjq0lduNxLyr"
mentioned sometimes as a “needle in a haystack”. We ran into extremely high CPU rates accross all the nodes. On a `30d` time range we also generally experience a time out in the result on `grafana` .
-
We use promtail for ingestion. We remark that it returns HTTP 429 when the retry mechanism kicks off, and after some time testing loki compared to our legacy system: we found out loki
is missing some of the log lines. We suspect the threshold of retries to be passed - in the case those lines would be dropped.We are still investigating…
On the loki side, at first we’ve been seeing the following errors practically all the time on the ingester
logs:
Ingestion rate limit exceeded for user fake (limit: 838860 bytes/sec) while attempting to ingest '229' lines totaling '764851' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased
And after playing a lot with the limit_configs, we just realized that we needed to scale up our nodes vertically one notch to make them dissapear.
Which it did, but we still occasionally “loosing” log lines…And this is quite problematic in our eyes.
As mentioned, we’ve been working on loki for quite a while and reached a dead end with those issues. If there is anything else we can/should do to make this better, please advice.
Last point: we are aware of the new bloom filters
experimental feature. We would like to wait with it, and exhaust every other options first.
nb: We are still not working with Prometheus…And so we have no access to loki’s metrics ( for now …)
Thank you :]
M.
- Can you share your entire Loki configuration, please?
- You can adjust
max_line_size
as well to accept bigger logs.
- You may want to adjust grpc message size for server components in general:
server:
# 100MB
grpc_server_max_recv_msg_size: 1.048576e+08
grpc_server_max_send_msg_size: 1.048576e+08
- You may find better success with smaller reader containers but have more of them.
- In general if logs are missing you should see in Loki’s log to see why it’s happening. If you are using promtail, promtail has a backoff setting that will have exponential backoff timeout configuration. We have this for all our promtail agents that can store up to 30 minutes of logs locally if Loki were unavailable.
Hello Tony
Thank you for jumping on this !
I am looking into your suggestions, so far, and will report back later on what I’ve found out.
I have one essential question though, regarding number 4: " You may find better success with smaller reader containers but have more of them.".
How does this translate in term of deployment specs? Should I limit the reads
pods resources using read.resources.limits.cpu
/ read.resources.limits.memory
, and have more replicas ?
Is this what you meant by “smaller” ?
Either way, the pod anti-affinity
rule used won’t allow me scheduling more than one pod on those nodes…
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: read
If you believe using the micro services approach is better in our case, please let me know. In my thinking, we should be able to work it out with the current set, but I may be wrong !
Our full config ( after a bit of masking important stuff…):
global:
dnsService: 'kube-dns'
write:
nodeSelector: { Usage: logs }
persistence:
storageClass: 'loki-gp3-sc'
replicas: 5
read:
nodeSelector: { Usage: logs }
persistence:
storageClass: 'loki-gp3-sc'
replicas: 5
backend:
replicas: 5
nodeSelector: { Usage: logs }
extraVolumeMounts:
- mountPath: /retention
name: retention
extraVolumes:
- emptyDir: {}
name: retention
persistence:
storageClass: 'loki-gp3-sc'
gateway:
nodeSelector: { Usage: logs }
loki:
auth_enabled: false
storage:
type: 's3'
s3:
region: eu-central-1
bucketNames:
chunks: prod-loki-log-streams
ruler: prod-loki-log-streams
admin: prod-loki-log-streams
limits_config:
reject_old_samples_max_age: 30d
max_line_size_truncate: true
split_queries_by_interval: 1h
query_timeout: 10m
tsdb_max_query_parallelism: 100
retention_period: 90d
retention_stream:
- selector: '{log_stream="ABCD"}'
period: 372d
priority: 100
schemaConfig:
configs:
- from: 2024-04-30
store: tsdb
object_store: aws
schema: v13
index:
prefix: index_
period: 24h
querier:
max_concurrent: 16
query_scheduler:
max_outstanding_requests_per_tenant: 3276
compactor:
retention_enabled: true
working_directory: /retention
delete_request_store: s3
retention_delete_delay: 2h
retention_delete_worker_count: 150
compaction_interval: 10m
ingester:
chunk_encoding: snappy
autoforget_unhealthy: true
serviceAccount:
name: loki-sa
annotations:
{
eks.amazonaws.com/role-arn: 'arn:aws:iam::ABCD:role/loki-sa-role',
}
chunksCache:
nodeSelector: { Usage: logs }
replicas: 3
batchSize: 256
parallelism: 10
allocatedMemory: 4096
connectionLimit: 1024
maxItemMemory: 2m
resultsCache:
nodeSelector: { Usage: logs }
replicas: 3
default_validity: 12h
monitoring:
dashboards:
enabled: false
rules:
enabled: false
serviceMonitor:
enabled: false
selfMonitoring:
enabled: false
grafanaAgent:
installOperator: false
lokiCanary:
enabled: false
test:
enabled: false
Lastly, regarding promtail
- our configuration is pretty basic. I will definitely look into the backoff
configs, and see how this play for us.
Thanks for sharing your experience! Have you checked Loki’s config settings and resource allocations? Sometimes tweaking those can significantly improve performance and Custom Wrestling Championship Belts resolve missing logs issues.
For the record:
- Most of our our performance issues were solved by changing what we are ingesting ( and making our ingested log lines “thinner” )
- There is no real magic here: there is no loki settings you can change that will drastically the way the system perform. when both the
write
and read
path runs on the same nodes, the only way to make this work is to optimize what is getting ingested and labeled.
Thank you for your advice and recommendations over here.
M.