Our load balancer writes access logs going into Loki via Promtail.
Today one of our big clients accessed the load balancer 64 times more than regularly. We have a dashboard displaying statistics from the load balancer log file. Due to the abnormal load the dashboard metric queries execution times took a significant hit and was nearly unusable.
We are looking for a solution to prevent this from happening again especially since despite the client extreme load our infrastructure worked as expected and there were no malfunctions other than Loki.
We are running in a limited on premise environment, simply adding cpu cores to Loki isn’t an option for us. Plus, this behavior is unacceptable since the dashboard was mainly created to identify abnormal client load, the dashboard misses his goal if abnormal behavior makes it unusable.
We have a few ideas on how to solve this:
- Configure a rate limit in Promtail or a stream specific rate limit in Loki, however it seems like no such option is available yet.
- Use FluentBit or FluentD that has a rate limit option.
- We use only static labels, so we are thinking that on nodes that write many logs and their logs are queried in dashboards maybe dynamic labels are unavoidable.
Did anyone experience such scenario? What would you recommend?