How to debug high log volume

rhaidiz · June 2, 2024, 1:17pm

Hello all.
I’m recently encountering a weird issue that I’m not sure how to debug. I’m currently on the free Grafana Cloud tier as I’m using it to monitor a couple of machines that I have in my homelab.
Nothing fancy, just some self hosted applications. For the last couple of months the log usage has skyrocketed and I finish the 50Giga within a couple of days.
I went to check the grafanacloud-[instanceName]-usage-insights datasource with the following query {instance_type="logs"} |= "path=write" and noticed an eccessive number of errors. Here too examples of the type of entries I find.

caller=manager.go:49 component=distributor path=write insight=true msg="write operation failed" details="Ingestion rate limit exceeded for user XXXX (limit: 0 bytes/sec) while attempting to ingest '55' lines totaling '5908' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased" org_id=XXX

caller=manager.go:49 component=distributor path=write insight=true msg="write operation failed" details="entry for stream '{container=\"caddy\", instance=\"XXX\", job=\"integrations/docker\", service_name=\"caddy\", stream=\"stderr\"}' has timestamp too old: 2024-05-13T12:43:36Z, oldest acceptable timestamp is: 2024-05-26T12:51:39Z" org_id=XXXX

Now the entries seems pretty self-explanatory and it looks like my instances are generating logs too fast, the date of which is too old, they get batched up and then rate limited.
However, I have absolutely no idea how to debug this to understand what is going on.
I’m mainly running containers in both instances but nothing fancy, as you can see one of the if caddy that I use as reverse proxy for reaching other internal containers.
Any idea what I could try?

Melody · June 7, 2024, 7:44pm

Hi @rhaidiz !

While I cannot say what the issue might be, I wanted to mention the Logs Volume Explorer feature, available in the user interface under Administration > Cost Management > Logs > Log Volume Explorer.

This tool allows you to explore log labels for specified durations, see total volume by label and ingest rate over time. You can also quickly jump to the Explore feature to review the long stream details.

Here’s an example of what this looks like:

And the relevant documentation:

You can also use this guide for how to query log usage in Explore:

With these resources, hopefully you can identify when usage spiked and what the log labels were to get a sense of what activity was occurring on the host.

rhaidiz · June 9, 2024, 6:10pm

Thanks @Melody , I’ve done some more testing and it appears the issue is with the docker integration. For some reason, after a few days, the logs collected by alloy and sent to loki starts to be older then the acceptable date.
Still not sure why this is happening tough .

rhaidiz · July 1, 2024, 10:24pm

I’m gonna mark this as solved since I managed to find the issue by simply starting commenting out pieces of the configuration until I narrowed it down to what was causing the issue, which appears to be the docker logging ingestion.
I’ve opened a new topic more specific to try figuring out why docker logs ingestion is causing such a high volume of logs with a too old timestamp.

Topic		Replies	Views
Log ingestion size Grafana Loki loki , grafana-cloud , logs	2	3666	October 23, 2023
Can't aggregate log counts into hourly buckets Dashboards loki	0	16	September 24, 2024
Grafana Free Tier RATE_LIMITED traces Grafana Cloud agent , tempo	1	940	August 2, 2022
Cannot querry logs even for two days at injestion rate ~300MB/hour Grafana Cloud	1	437	April 5, 2022
Grafana Cloud Loki ingestion rate limit increasement Grafana Cloud loki , grafana-cloud	1	1184	September 27, 2023

How to debug high log volume

Related topics