I am testing out a simple Loki configuration based on 2-S3-Cluster-Example.yaml
It works well on local (using minio for storage) but behaves a bit differently when trying on AWS instances. For some reason, I get timeouts when running queries after starting the Loki docker container or when running queries after a period of inactivity (by inactivity I mean a period where I am continuously ingesting data through promtail but not running any queries). This goes on for something like 5 minutes, after which queries just start working. It doesn’t matter what the query range is, it behaves the exact same way for 6h and for 30d. Log volume is fairly small, just a couple tens of Mb in total so small range queries could be processing tens or hundreds of Kb and still time out
Pasting my configuration below:
auth_enabled: false
server:
http_listen_port: 3100
common:
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
instance_addr: 127.0.0.1
replication_factor: 1
ingester:
max_chunk_age: 72h
wal:
dir: /loki/wal
schema_config:
configs:
- from: "2020-10-24"
index:
period: 24h
prefix: index_
object_store: s3
schema: v13
store: tsdb
storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
aws:
s3: s3://<<REGION>>/<<BUCKET_NAME>>
s3forcepathstyle: true
query_scheduler:
# the TSDB index dispatches many more, but each individually smaller, requests.
# We increase the pending request queue sizes to compensate.
max_outstanding_requests_per_tenant: 32768
querier:
# Each `querier` component process runs a number of parallel workers to process queries simultaneously.
# You may want to adjust this up or down depending on your resource usage
# (more available cpu and memory can tolerate higher values and vice versa),
# but we find the most success running at around `16` with tsdb
max_concurrent: 16
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2160h
retention_delete_worker_count: 150
delete_request_store: s3
limits_config:
reject_old_samples: false
max_query_length: 2160h # Default: 721h
retention_period: 2160h
ingestion_rate_mb: 64
ingestion_burst_size_mb: 128
per_stream_rate_limit: 8MB
per_stream_rate_limit_burst: 32MB
I am using Loki as a datasource for Grafana (running on the same instance).
Labels are few and with low cardinality, so I don’t think that should be an issue (i’ve got 4 indexed labels: environment
, job
, service_name
, status
).
Really hope I am doing something wrong in the config above. I haven’t changed any timeouts in the Grafana datasource config yet as this shouldn’t happen on such low data volumes and I don’t want to mask the issue this way.
How To Reproduce
Steps to reproduce the behavior:
- Started Loki (grafana/loki:3.2.0)
- Tried a simple query such as
{job="nginx"}
over the last 24 hours (which processed 139.3 KiB of data in my case) - Got
loki net/http: request canceled (Client.Timeout exceeded while awaiting headers)
after 30s - Retried a few times over the next 5 minutes until it started working, after which any query range works fine (e.g. 30d which processed 30.1 MiB of data in my case)
- Closed Grafana and waited around half a day
- Got the exact same behavior starting at step 2
Expected behavior:
Queries (especially small ones processing just a couple thousand log lines) should work without timing out.
Environment:
- Infrastructure: t2.xlarge instance on AWS, running inside a Docker container
- Docker image version: grafana/loki:3.2.0
- Grafana version: v10.1.1
- Storage: S3 bucked in the same region as the instance