Currently we’re using Loki and Fluentbit to shipping logs from our third party application. The application can produce ~400k/5min logs. We’re using loki-distributed on our cluster with 3 shared nodes for monitoring stuff(4CPUs, 32GB ram), here is our current config.
---
loki:
auth_enabled: false
schemaConfig:
configs:
- from: 2024-10-10
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
ingester:
chunk_encoding: snappy
chunk_idle_period: 1h
chunk_target_size: 1536000
max_chunk_age: 1h
wal:
enabled: false
tracing:
enabled: true
querier:
# Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
max_concurrent: 8
grpc_client:
max_send_msg_size: 999999999999
max_recv_msg_size: 999999999999
compactor:
apply_retention_interval: 24h
compaction_interval: 5m
retention_delete_worker_count: 1000
retention_enabled: true
retention_delete_delay: 2h
working_directory: /var/loki/data/compactor
delete_request_store: aws
frontend_worker:
grpc_client_config:
max_send_msg_size: 999999999999
max_recv_msg_size: 999999999999
server:
http_server_read_timeout: 1200s # allow longer time span queries
http_server_idle_timeout: 1200s
http_server_write_timeout: 1200s # allow longer time span queries
grpc_server_max_recv_msg_size: 999999999999
grpc_server_max_send_msg_size: 999999999999
grpc_server_max_concurrent_streams: 10000000
grpc_server_max_connection_age: 15m
grpc_server_max_connection_age_grace: 15m
grpc_server_max_connection_idle: 10m
limits_config:
max_query_series: 1000000
reject_old_samples: true
reject_old_samples_max_age: 336h
retention_period: 720h
max_query_parallelism: 100
max_entries_limit_per_query: 1000000
max_global_streams_per_user: 0
query_timeout: 30m
split_queries_by_interval: 30m
unordered_writes: true
shard_streams:
enabled: true
storage:
bucketNames:
chunks: loki-s3-data-storage
ruler: loki-s3-data-storage
type: s3
s3:
region: eu-west-2
endpoint: s3.eu-west-2.amazonaws.com
s3forcepathstyle: false
insecure: false
accessKeyId: "${AWS_ACCESS_KEY_ID}"
secretAccessKey: "${AWS_SECRET_ACCESS_KEY}"
deploymentMode: Distributed
ingester:
replicas: 3
tolerations:
- key: "group"
operator: "Equal"
value: "monitoring"
effect: "PreferNoSchedule"
nodeSelector:
group: monitoring
extraArgs:
- "-config.expand-env=true"
extraEnv:
- name: GRAFANA_LOKI_S3_ENDPOINT
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_endpoint
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_accesskeyid
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_secretaccesskey
querier:
replicas: 3
maxUnavailable: 2
tolerations:
- key: "group"
operator: "Equal"
value: "monitoring"
effect: "PreferNoSchedule"
nodeSelector:
group: monitoring
extraArgs:
- "-config.expand-env=true"
extraEnv:
- name: GRAFANA_LOKI_S3_ENDPOINT
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_endpoint
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_accesskeyid
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_secretaccesskey
queryFrontend:
replicas: 2
maxUnavailable: 1
tolerations:
- key: "group"
operator: "Equal"
value: "monitoring"
effect: "PreferNoSchedule"
nodeSelector:
group: monitoring
extraArgs:
- "-config.expand-env=true"
extraEnv:
- name: GRAFANA_LOKI_S3_ENDPOINT
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_endpoint
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_accesskeyid
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_secretaccesskey
queryScheduler:
replicas: 2
tolerations:
- key: "group"
operator: "Equal"
value: "monitoring"
effect: "PreferNoSchedule"
nodeSelector:
group: monitoring
extraArgs:
- "-config.expand-env=true"
extraEnv:
- name: GRAFANA_LOKI_S3_ENDPOINT
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_endpoint
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_accesskeyid
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_secretaccesskey
distributor:
replicas: 3
maxUnavailable: 2
tolerations:
- key: "group"
operator: "Equal"
value: "monitoring"
effect: "PreferNoSchedule"
nodeSelector:
group: monitoring
extraArgs:
- "-config.expand-env=true"
extraEnv:
- name: GRAFANA_LOKI_S3_ENDPOINT
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_endpoint
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_accesskeyid
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_secretaccesskey
compactor:
replicas: 1
tolerations:
- key: "group"
operator: "Equal"
value: "monitoring"
effect: "PreferNoSchedule"
nodeSelector:
group: monitoring
extraArgs:
- "-config.expand-env=true"
extraEnv:
- name: GRAFANA_LOKI_S3_ENDPOINT
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_endpoint
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_accesskeyid
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_secretaccesskey
indexGateway:
replicas: 2
maxUnavailable: 1
tolerations:
- key: "group"
operator: "Equal"
value: "monitoring"
effect: "PreferNoSchedule"
nodeSelector:
group: monitoring
extraArgs:
- "-config.expand-env=true"
extraEnv:
- name: GRAFANA_LOKI_S3_ENDPOINT
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_endpoint
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_accesskeyid
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_secretaccesskey
ruler:
replicas: 1
tolerations:
- key: "group"
operator: "Equal"
value: "monitoring"
effect: "PreferNoSchedule"
nodeSelector:
group: monitoring
extraArgs:
- "-config.expand-env=true"
extraEnv:
- name: GRAFANA_LOKI_S3_ENDPOINT
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_endpoint
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_accesskeyid
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: loki-s3-secrets
key: grafana_loki_s3_secretaccesskey
resultsCache:
tolerations:
- key: "group"
operator: "Equal"
value: "monitoring"
effect: "PreferNoSchedule"
nodeSelector:
group: monitoring
chunksCache:
batchSize: 10
parallelism: 10
tolerations:
- key: "group"
operator: "Equal"
value: "monitoring"
effect: "PreferNoSchedule"
nodeSelector:
group: monitoring
monitoring:
dashboards:
enabled: true
rules:
enabled: true
serviceMonitor:
labels:
release: kube-prometheus-stack
namespaceSelector:
matchNames:
- monitoring
enabled: true
Here is the log format in json
{"Rate_Limited":"null","Rate_Limit_Log":"null","asn":"5645","client_ip":"22.26.13.157","fastly_is_edge":true,"fastly_server":"cache-yyz4550-YYZ","geo_city":"north york","geo_country":"canada","host":"uwu.com","request_method":"GET","request_protocol":"HTTP/2","request_referer":"https://uwu.com/?utm_source=propellerads\u0026utm_campaign=popunders_win_Canada\u0026utm_medium=paid","request_user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:132.0) Gecko/20100101 Firefox/132.0","response_body_size":78,"response_reason":"Found","response_state":"HIT-CLUSTER","response_status":302,"timestamp":"2024-11-05T05:07:05+0000","url":"/resized-image?width=23\u0026height=16\u0026game_code=evo_koreanspeakingspeedbaccarat"}
Here is our fluentbit config to forward the logs to our loki.
---
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentbit-uwu-config
namespace: monitoring
labels:
app.kubernetes.io/instance: fluentbit-fastly
app.kubernetes.io/name: fluentbit-uwu
app.kubernetes.io/version: 2.7.2
data:
labels-map.json: |
{
"Rate_Limited": "Rate_Limited",
"asn": "asn",
"client_ip": "client_ip",
"geo_country": "geo_country",
"host": "host",
"request_method": "request_method",
"request_user_agent": "request_user_agent",
"response_state": "response_state",
"response_status": "response_status",
"url": "url"
}
fluent-bit.conf: |
[SERVICE]
HTTP_Server on
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
[INPUT]
Name http
Listen 0.0.0.0
Port 8082
Tag uwu
[OUTPUT]
Name grafana-loki
Match uwu
Url http://loki-gateway/api/prom/push
RemoveKeys source
Labels {job="fastly-cdn-uwu"}
BatchWait 1
BatchSize 524288
LineFormat json
LogLevel info
---
At first we tried to add more labels to our logs to make it more easier to query, but we got an error in our loki like this. And we remove the additional labels.
level=warn caller=client.go:379 id=0 component=client host=loki-gateway msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Maximum active stream limit exceeded, reduce the number of active streams (reduce labels or reduce label values), or contact your Loki administrator to see if the limit can be increased, user: 'fake'"
So, we tried some LogQL to visualize our data, it works well with small data, but we got 504 timeout the querier pods got OOM when we use our prod data when the number of logs is around 400k/5min
.
## Get top 15 URL
topk(15, sum by (url) (count_over_time({job="$fastly_cdn"} | json [30m])))
## Get top 15 ASN
topk(15, sum by (asn) (count_over_time({job="$fastly_cdn"} | json [$__interval])))
## Get top Visitor
topk(15 ,sum by (client_ip, geo_country)(count_over_time({job="$fastly_cdn"} | json | __error__ = "" [30m])))
## Req per status code
sum by (response_status) (count_over_time({job="$fastly_cdn"} | json | response_status=~"200|401|403|500|501|503|0" | __error__="" [$__interval]))
So, the question are.
- Is there any recommendation to optimize the query performance(we aim to have a 1 month data to visualize, the data ~3.45 billion)?
- Is there any resource/architecture recommendation? Maybe we need to increase our nodes?