We have a particular set of microservices that log huge amounts of text for regulatory reasons. We are using loki in microservices mode with seperate replicasets for ingesters/queriers/gateway/distributor/frontend etc.
I have inherited this deployment and have found that loki can timesout when querying an 0.5 hour time period in grafana, despite the timeout being set to 1 minute.
Where can i find the best settings for the config for a use case like this? Is there anywhere i can find concrete guidelines?
b0b
February 14, 2022, 8:51am
2
Hello,
have a look(i) (sorry could not resist…) at Label best practices | Grafana Loki documentation
I think in particular getting chunk_target_size
right would be a first step.
Other posts I have looked at are
Hello!
Recently we’ve deployed Loki in the Kubernetes cluster using loki-distributed chart. Currently the incoming amount of logs is about 10 GB per day - not a big amount to be fair. This amount will only grow in the future. It seems there are no problems in logs ingestion. The main problem we are facing is querying.
At first when the amount of daily logs was really small - there was absolutely no problems in executing any filter queries like this for a long period like 24h or even more
{env…
I have 1 query frontend deployed on a vm with 2 queriers each on a separate vm pulling queries from the frontend. Queries are executed in grafana.
When executing queries (refreshing a dashboard) in a small time-range everything works fine, however at around a 12 hours time-range after a while i’m getting a 502 bad gateway.
Values I tried to edit:
In loki:
querier:
query_timeout: 10m
engine:
timeout: 10m
server:
http_server_read_timeout: 10m
http_server_write_timeout: 10m
In grafana:
…
This is also from a post but I have only copied the text into my own Loki optimization doc…
increasing of parameter query_range: split_queries_by_interval: 24h decreases total time to 1-1.3 min (for 30 days range)
(previously 4.5-5 min)
This is a config for query frontend . Not sure what a good value for your use case would be though. Probably not 24h…
Hope that helps.
Thanks for your reply, yes i’ve already tried to implement everything stated in the docs, however we still get inconsistent query results in terms of timeouts when selecting a 3 hour + time period.
Here is my config.
compactor:
enabled: true
resources:
limits:
cpu: 500m
memory: 128Mi
requests:
cpu: 500m
memory: 128Mi
nodeSelector:
lifecycle: spot
gateway:
nodeSelector:
lifecycle: spot
replicas: 6
ingress:
enabled: true
ingressClassName: redacted
annotations:
# for older clusters
# kubernetes.io/ingress.class: redacted
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
# this might help with timeouts needs testing
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
hosts:
- host: redacted
paths:
- path: /
pathType: Prefix
tls: []
nginxConfig:
httpSnippet: |
proxy_read_timeout 600;
proxy_connect_timeout 600;
proxy_send_timeout 600;
loki:
config: |
auth_enabled: false
compactor:
shared_store: s3
server:
log_level: info
http_listen_port: 3100
# this stops received message larger than max error
# number was double what the default is yolo
grpc_server_max_recv_msg_size: 20730922
grpc_server_max_send_msg_size: 20730922
distributor:
ring:
kvstore:
store: memberlist
memberlist:
join_members:
- loki-distributed-memberlist
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 3
chunk_target_size: 1536000
chunk_idle_period: 15m
chunk_block_size: 262144
chunk_encoding: snappy
chunk_retain_period: 5m
max_transfer_retries: 0
wal:
enabled: true
dir: /var/loki/wal/
replay_memory_ceiling: 10GB
limits_config:
ingestion_rate_mb: 1000
enforce_metric_name: false
reject_old_samples: false
reject_old_samples_max_age: 24h
max_cache_freshness_per_query: 10m
max_concurrent_tail_requests: 200
max_query_parallelism: 96
max_streams_per_user: 40000
per_stream_rate_limit: 800MB
cardinality_limit: 300000
schema_config:
configs:
- from: 2020-09-07
store: boltdb-shipper
object_store: aws
schema: v11
index:
prefix: loki_v2_index_
period: 24h
storage_config:
index_queries_cache_config:
enable_fifocache: false
memcached:
expiration: 24h
batch_size: 100
parallelism: 200
memcached_client:
consistent_hash: true
host: loki-distributed-memcached-index-queries
service: http
boltdb_shipper:
active_index_directory: /var/loki/indexv2
shared_store: s3
cache_location: /var/loki/cache
cache_ttl: 168h
filesystem:
directory: /var/loki/chunks
aws:
s3: s3://redacted
bucketnames: redacted
sse_encryption: true
chunk_store_config:
chunk_cache_config:
enable_fifocache: false
memcached:
expiration: 2h
batch_size: 100
parallelism: 200
memcached_client:
consistent_hash: true
host: loki-distributed-memcached-chunks
service: http
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
query_range:
align_queries_with_step: true
max_retries: 5
split_queries_by_interval: 10m
parallelise_shardable_queries: true
cache_results: true
results_cache:
cache:
enable_fifocache: true
memcached_client:
consistent_hash: true
host: loki-distributed-memcached-chunks
max_idle_conns: 16
service: http
timeout: 500ms
update_interval: 1m
querier:
query_ingesters_within: 3h
query_timeout: 10m
tail_max_duration: 24h
frontend_worker:
frontend_address: loki-distributed-query-frontend:9095
parallelism: 6 #6 cores available
frontend:
log_queries_longer_than: 30s
compress_responses: true
max_outstanding_per_tenant: 1024
distributor:
replicas: 4
resources:
limits:
cpu: 500m
memory: 256Mi
requests:
cpu: 500m
memory: 256Mi
nodeSelector:
lifecycle: spot
ingester:
persistence:
enabled: true
storageClass: gp2
size: 30G
resources:
limits:
cpu: 2000m
memory: 14Gi
requests:
cpu: 2000m
memory: 14Gi
replicas: 6
nodeSelector:
lifecycle: spot
memcachedChunks:
enabled: true
replicas: 8
extraArgs:
- -m 19000
- -I 10m
- -vvv
resources:
requests:
cpu: 1000m
memory: 20Gi
limits:
cpu: 1000m
memory: 20Gi
nodeSelector:
lifecycle: spot
memcachedIndexQueries:
replicas: 4
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 500m
memory: 2Gi
enabled: true
nodeSelector:
lifecycle: spot
memcachedExporter:
enabled: true
resources:
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 100m
memory: 50Mi
queryFrontend:
replicas: 3
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 200m
memory: 256Mi
nodeSelector:
lifecycle: spot
querier:
resources:
requests:
cpu: 2000m
memory: 2Gi
limits:
cpu: 6000m
memory: 2Gi
replicas: 16
nodeSelector:
lifecycle: spot
memcachedFrontend:
nodeSelector:
lifecycle: spot
serviceMonitor:
enabled: true
labels:
release: kube-prometheus-stack
interval: 30s
serviceAccount:
create: true
name: "loki-distributed"
annotations:
eks.amazonaws.com/role-arn: redacted
Query example:
count_over_time({job="logging/logtest"} [1s])
for a 3-24 hour period.
something like this will often timeout with a 502 when done via the logcli or a 504 when run via grafana.
system
Closed
February 20, 2023, 6:05pm
5
This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.