Hi everyone,
I’ve deployed the following Grafana-based observability stack on AWS ECS:
Cloudflare Logpush → S3
S3 → Logstash
Logstash → Grafana Alloy
Alloy → Loki
Loki → Grafana
All components are running in ECS containers.
Issue 1: Logstash Processing Too Slowly
Logstash is processing logs from S3 far too slowly. With the current configuration, it struggles to keep up with the volume of data being ingested daily (~250–300 million JSON logs/day, expected to scale up to ~1 billion/day). This has become a major bottleneck in the pipeline.
Questions:
- Are there tuning recommendations for Logstash when pulling data from S3 at this scale?
- Would switching to another tool like Fluent Bit, Fluentd, or Vector improve throughput and performance from S3?
Issue 2: Loki Query Timeout on Large Ranges
I’m running Loki in a Writer/Reader split mode:
2 Writer nodes
3 Reader nodes
Querying 1 hour of logs works fine (returns results in 2–3 minutes), but when I try querying a larger time window (e.g., 6 hours or more), it often results in a timeout.
Questions:
- What are the best practices for scaling Loki for high-ingest and long-range querying?
- Are there specific tunings I should apply to the querier or index gateway components?
- Could chunk/index configuration be affecting performance?
Looking Ahead
Given the projected log volume (up to 1 billion logs/day), I’m open to re-architecting parts of the pipeline if needed. Any suggestions on more scalable and efficient designs or tooling are very welcome.
Thanks in advance!