Grafana Alloy/loki log ingestion

turtle_man · March 19, 2025, 9:19pm

I am using grafana alloy as a daemonset to process and forward logs to loki SSD(write, backend, read).
I am using local.file_match and loki.source.file.
the problem I am facing is somewhat bamboozling me.
For some containers the log files are being tailed and data is ingested when alloy starts up but after the initial ingestion I don’t see any logs in loki…on alloys end it is still tailing the log file and soemtimes it exits and restarts the tail alot with message""…so i assume that it is a loki issue…on lokis end there are no errors related to the ingestion and actually for that specific container and file path it shows that there was flushing of streams at the same time as alloy started the tail…but after the initial flushing there is no more flushing of streams for that container but alloy is flushing other container streams. there is random flushing at random time after a long time after the initial flushing and for the second flushing i see logs but there are limited.
The thing is that the same alloy is working perfectly fine for other containers on the node.

below is an example of how alloy keeps tailing, exiting tails and starting them again.

2025-03-19T12:53:35.471112951Z  level=info msg="tail routine: started" component_path=/ component_id=loki.source.file.container_logs component=tailer path=/var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39......

2025-03-19T12:53:35.471107722Z  level=info msg="Seeked /var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39..... component_path=/ component_id=loki.source.file.container_logs

2025-03-19T12:53:35.458198616Z msg="stopped tailing file" component_path=/ component_id=loki.source.file.container_logs component=tailer path=/var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39.......

2025-03-19T12:53:35.458188211Z msg="position timer: exited" component_path=/ component_id=loki.source.file.container_logs component=tailer path=/var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39........

2025-03-19T12:53:35.458175253Z  msg="tail routine: exited" component_path=/ component_id=loki.source.file.container_logs component=tailer path=/var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39......

2025-03-19T12:53:35.458152825Z  msg="tail routine: tail channel closed, stopping tailer" component_path=/ component_id=loki.source.file.container_logs component=tailer path=/var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39.... reason=<nil>

2025-03-19T12:53:25.469957777Z msg="tail routine: started" component_path=/ component_id=loki.source.file.container_logs component=tailer path=/var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39.....

2025-03-19T12:53:25.469852938Z  msg="Seeked /var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39.......

tonyswumac · March 19, 2025, 11:13pm

Since these are container logs, they get removed by container log driver from time to time, i am not sure that’s what’s causing your problem.

What exactly is the issue you are experiencing?

turtle_man · March 20, 2025, 12:05am

Container is generating logs, alloy is tailing the files even after rotation, but loki is not showing the logs for some containers(such as the one in the above path)
Intermittently loki showed the logs of above container after 9 days only for 1 hr period.

tonyswumac · March 20, 2025, 2:29pm

This sounds strange. How are you deploying your Loki cluster? How big is the cluster? Can you share your Loki configuration?

turtle_man · March 20, 2025, 8:43pm

deploying it as SSD…read write and backend
we ingest about 1-2 tb per month
so not big at all.
no resource issues. all components running well within the allocated resources.

limits_config:
retention_period: 720h
reject_old_samples: true
reject_old_samples_max_age: 720h
max_cache_freshness_per_query: 10m
split_queries_by_interval: 1h
per_stream_rate_limit: 5242880
per_stream_rate_limit_burst: 20971520
cardinality_limit: 200000
ingestion_burst_size_mb: 1200
ingestion_rate_mb: 1200
max_entries_limit_per_query: 500000
max_label_value_length: 20480
max_label_name_length: 10240
max_label_names_per_series: 300
max_global_streams_per_user: #added 3/6/2025
max_line_size: 262144 #added 3/6/2025
max_line_size_truncate: true #added 3/6/2025
tsdb_max_query_parallelism: 128 #added 3/6/2025
query_timeout: 400s
volume_enabled: true
using gcs storage as backend

I am getting one error from backend but not sure if that would affect this:
msg=“unable to list rules” err=“googleapi: Error 403: **************************************** does not have storage.objects.list access to the Google Cloud Storage bucket. Permission ‘storage.objects.list’ denied on resource (or it may not exist)., forbidden”

tonyswumac · March 20, 2025, 9:02pm

Maybe, but regardless you should fix the permission issue.

I don’t see anything obviously wrong. I think your first order of business is to fix the permission, then try to narrow down where your problem is (is it alloy or Loki).

You said that Loki intermittently show logs for some log stream, this is a very specific behavior, can you reproduce this easily?

turtle_man · March 25, 2025, 11:19am

Yeah I am on the permissions issue.
Nevertheless, I did some digging and i see that alloy components loki_write are dropping bytes due to ingester error…one reason i found is that the ingestion_rate is an issue..
ingestion_burst_size_mb: 1200
ingestion_rate_mb: 1200
I found two solutions:

Increasing these values but currently my loki_write is running with 2 gb request and 3 gb limit

2)is to implement write ahead log…

which one or if another one would you suggest and what other considerations would have to be taken for each?

tonyswumac · March 25, 2025, 3:54pm

I think the ingestion rate is purposely set low by default, and I would recommend you to do this:

Increase both burst size and rate.
Implement WAL on your Loki ingester.
On alloy, implement back-off configuration. This is not as critical, but it’ll prevent you from losing logs when Loki can’t take the load. You’ll of course want to set up an alert for this when Alloy starts to hit the backoff, otherwise you risk Alloy going down, too. You should decide if this is worth the effort in your use case.

turtle_man · March 27, 2025, 2:39pm

Have increased the ingestion rate to 3.5 gb and 4 gb burst while increasing loki-write resources, implemented wal and still getting same issue.

tonyswumac · March 27, 2025, 5:34pm

What error is your alloy agent giving you?

turtle_man · March 27, 2025, 7:10pm

So loki is not showing any dropped samples.
Alloy is showing loki_write component dropping bytes due to “ingester error”, that is all that it says.
The Specific alloy logs show that intermittently thsi error:

ts=2025-03-27T12:01:32.220067158Z level=error msg=“final error sending batch” component_path=/ component_id=loki.write.default component=client host=loki-gateway.logging.svc.cluster.local status=413 tenant=“” error=“server returned HTTP status 413 Request Entity Too Large (413): ”

turtle_man · March 27, 2025, 7:14pm

BTW for WAL to be implemented do I have to specify it under loki.write.endpoint.wal block?
Documentation says it is experimental…

tonyswumac · March 27, 2025, 9:38pm

Try adjusting per_stream_rate_limit: 100M and per_stream_rate_limit_burst: 200M as well.
Regarding WAL I was referring to WAL on Loki, not alloy.
If HTTP request fails with 413 Loki should have logs for it. Do you have any reversed proxy in front of your Loki write containers?

turtle_man · March 29, 2025, 1:31pm

Okay let me try the adjustment in values
I am using loki gateway but other than the default settings I haven’t added anything so i assume there isn’t a reverse proxy

turtle_man · April 9, 2025, 4:10pm

still is not working and dropping the logs

tonyswumac · April 9, 2025, 5:00pm

Couple of more things to check:

There are other limits under limits_config. Check the label limits (such as max_label_value_length) and max_line_size and see if you are perhaps sending very big log line.
Do you have any load balancer in front of your loki write containers? If so, check the load balancer and see if there is any error there.

turtle_man · April 9, 2025, 8:03pm

max_label_value_length: 20480
max_label_name_length: 10240
max_label_names_per_series: 300

these are the values I have…
does loki-gateway count as a loadbalancer? I am assuming so…receiving this error quite a few times

2025/04/09 17:01:19 [error] 11#11: *30083574 client intended to send too large body: 7682330 bytes, client: 10.32.10.75, server: , request: “POST /loki/api/v1/push HTTP/1.1”, host: "loki-gateway.logging.svc.cluster.local

tonyswumac · April 10, 2025, 3:03am

Ah yes, i think we found the culprit then. I believe loki-gateway uses nginx, so you’d want to see client_max_body_size to a higher value.

turtle_man · April 10, 2025, 1:48pm

When I try to increase that through values file in the chart it says:
/docker-entrypoint.sh: No files found in /docker-entrypoint.d/, skipping configuration
2025/04/09 20:29:49 [emerg] 1#1: “client_max_body_size” directive is duplicate in /etc/nginx/nginx.conf:35
nginx: [emerg] “client_max_body_size” directive is duplicate in /etc/nginx/nginx.conf:35

not sure how to access the original file im on a cloud kubernetes engine…

turtle_man · April 10, 2025, 1:52pm

also i had increased batch_Size from alloy loki.write end to 50Mi which might have casued the issue…however this error was not occurring previously and logs were still being dropped…right now i have reduced it back to 1mb from alloy batch_size so it is safe for nginx default which is also 1m, logs still beign dropped due to ingester error.
still seeing this error as well:
2025/04/10 08:21:31 [error] 9#9: *142147 client intended to send too large body: 13669320 bytes, client: 10.32.10.86, server: ,

Topic		Replies	Views
Loki Logs are not Coming after some time in Grafana Grafana Alloy	1	90	December 4, 2024
Cannot see json logs when using alloy to forward logs to Loki Grafana Alloy loki	2	74	April 1, 2025
Alloy config ingest local log file Grafana Loki loki , alloy	2	107	August 21, 2024
Log-gaps when high amount of logs Grafana Loki logs	4	336	October 7, 2024
Logs sending to Loki Grafana Alloy loki , alloy	1	177	April 30, 2025

Grafana Alloy/loki log ingestion

Related topics