Grafana Stack on AWS ECS - Performance Issues with Logstash and Loki Query Timeouts

Hi everyone,

I’ve deployed the following Grafana-based observability stack on AWS ECS:

Cloudflare Logpush → S3
S3 → Logstash
Logstash → Grafana Alloy
Alloy → Loki
Loki → Grafana

All components are running in ECS containers.

Issue 1: Logstash Processing Too Slowly

Logstash is processing logs from S3 far too slowly. With the current configuration, it struggles to keep up with the volume of data being ingested daily (~250–300 million JSON logs/day, expected to scale up to ~1 billion/day). This has become a major bottleneck in the pipeline.

Questions:

  • Are there tuning recommendations for Logstash when pulling data from S3 at this scale?
  • Would switching to another tool like Fluent Bit, Fluentd, or Vector improve throughput and performance from S3?

Issue 2: Loki Query Timeout on Large Ranges

I’m running Loki in a Writer/Reader split mode:

2 Writer nodes
3 Reader nodes

Querying 1 hour of logs works fine (returns results in 2–3 minutes), but when I try querying a larger time window (e.g., 6 hours or more), it often results in a timeout.
Questions:

  • What are the best practices for scaling Loki for high-ingest and long-range querying?
  • Are there specific tunings I should apply to the querier or index gateway components?
  • Could chunk/index configuration be affecting performance?

Looking Ahead

Given the projected log volume (up to 1 billion logs/day), I’m open to re-architecting parts of the pipeline if needed. Any suggestions on more scalable and efficient designs or tooling are very welcome.

Thanks in advance!

Your question is too broad and sizing details are missing.

Ingestion path:

Cloudflare Logpush → S3
S3 → Logstash
Logstash → Grafana Alloy
Alloy → Loki

Why so complicated, when you can:

Cloudflare Logpush → S3
S3 event → Lambda → Loki

Of course you need to keep in mind/set Lambda concurrency limits. It will be very likely more expensive (but you didn’t mention any price limitations, your focus is on the performance only). But it will scale automatically. I guess Loki can be bottleneck then (but I don’t comment on it).

If we are going to use Lambda its too costly , we need balance between cost and performance.

Do you have any hard numbers to prove that, pls? You have many options how to optimize Lambda costs (Graviton, batching, …)