today I have observed that I have a bunch of these kind of errors in the loki-write pods:
org_id=fake msg="failed to flush" err="failed to flush chunks: store put chunk: SlowDown: A timeout exceeded while waiting to proceed with the request, please reduce your request rate\n\tstatus code: 503
the Loki is writing to an s3 bucket and even if the loki datasource in grafana seam to be working. There is a timeout in retrieving the labels.
First you should find out if you are indeed being rate limited by S3. The SlowDown message usually means it is S3 rate limiting you, but still good to confirm
If your Loki cluster is indeed rate limited by S3, couple of things I can think of that might help:
Try to write less files less frequently by tweaking expected chunk size, idle period, and max idle period.
Try using multiple S3 buckets as chunk storage.
As for label, you should find out why it’s timing out. I doubt it’s because of rate limiting. S3 allows 5500 GET requests per second, you should not run out of that by simply reading index files for labels.
The S3 is an object store configured on a freenas installation. I see no config that would limit the S3 bucket. Is it better to write larger chunks less often or smaller chunks more often? How do I configure multiple buckets as chunk storage?
I would double check freenas. There has to be rate limiting somewhere, but I’ve never used it so maybe I am just wrong.
It’s better to write a reasonably sized chunks less often. This is a consideration for any sort of distributed storage and computing systems (think Hadoop and Map Reduce). You don’t want to write too many files so you don’t incur too much network overhead, but you also don’t want to write too large of a file so you don’t have to waste bandwidth (a certain percentage of each chunk read is never needed either because it’s filtered out or limited by the number of results returned). I find the default value from Loki is reasonable, but if you run into rate limiting then it’s up to you to figure out what your sweet spot is.
If you look under aws_storage_config you’ll see that bucketnames can be a comma separated list of bucket names. I’ve not used this before, so I can’t say for certain, but if you are looking to migrate from single bucket to multi bucket smoothly you probably need to create a new schema period and specify multi buckets configuration there. If you aren’t running a production cluster and are ok with losing all data then you can probably just slap it on.
it look like it indeed was a problem with the S3 server. For some reason it’s ingesting capability was limited and was not responding to queries. After restart all came back to normal.