Grafana Alloy/loki log ingestion

I am using grafana alloy as a daemonset to process and forward logs to loki SSD(write, backend, read).
I am using local.file_match and loki.source.file.
the problem I am facing is somewhat bamboozling me.
For some containers the log files are being tailed and data is ingested when alloy starts up but after the initial ingestion I don’t see any logs in loki…on alloys end it is still tailing the log file and soemtimes it exits and restarts the tail alot with message""…so i assume that it is a loki issue…on lokis end there are no errors related to the ingestion and actually for that specific container and file path it shows that there was flushing of streams at the same time as alloy started the tail…but after the initial flushing there is no more flushing of streams for that container but alloy is flushing other container streams. there is random flushing at random time after a long time after the initial flushing and for the second flushing i see logs but there are limited.
The thing is that the same alloy is working perfectly fine for other containers on the node.

below is an example of how alloy keeps tailing, exiting tails and starting them again.

2025-03-19T12:53:35.471112951Z  level=info msg="tail routine: started" component_path=/ component_id=loki.source.file.container_logs component=tailer path=/var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39......

2025-03-19T12:53:35.471107722Z  level=info msg="Seeked /var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39..... component_path=/ component_id=loki.source.file.container_logs

2025-03-19T12:53:35.458198616Z msg="stopped tailing file" component_path=/ component_id=loki.source.file.container_logs component=tailer path=/var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39.......

2025-03-19T12:53:35.458188211Z msg="position timer: exited" component_path=/ component_id=loki.source.file.container_logs component=tailer path=/var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39........

2025-03-19T12:53:35.458175253Z  msg="tail routine: exited" component_path=/ component_id=loki.source.file.container_logs component=tailer path=/var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39......

2025-03-19T12:53:35.458152825Z  msg="tail routine: tail channel closed, stopping tailer" component_path=/ component_id=loki.source.file.container_logs component=tailer path=/var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39.... reason=<nil>

2025-03-19T12:53:25.469957777Z msg="tail routine: started" component_path=/ component_id=loki.source.file.container_logs component=tailer path=/var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39.....

2025-03-19T12:53:25.469852938Z  msg="Seeked /var/log/containers/summary-generation-5564bc78d7-sxkbt_virtual-assistant_summary-generation-6492a39.......

Since these are container logs, they get removed by container log driver from time to time, i am not sure that’s what’s causing your problem.

What exactly is the issue you are experiencing?

Container is generating logs, alloy is tailing the files even after rotation, but loki is not showing the logs for some containers(such as the one in the above path)
Intermittently loki showed the logs of above container after 9 days only for 1 hr period.

This sounds strange. How are you deploying your Loki cluster? How big is the cluster? Can you share your Loki configuration?

deploying it as SSD…read write and backend
we ingest about 1-2 tb per month
so not big at all.
no resource issues. all components running well within the allocated resources.

limits_config:
retention_period: 720h
reject_old_samples: true
reject_old_samples_max_age: 720h
max_cache_freshness_per_query: 10m
split_queries_by_interval: 1h
per_stream_rate_limit: 5242880
per_stream_rate_limit_burst: 20971520
cardinality_limit: 200000
ingestion_burst_size_mb: 1200
ingestion_rate_mb: 1200
max_entries_limit_per_query: 500000
max_label_value_length: 20480
max_label_name_length: 10240
max_label_names_per_series: 300
max_global_streams_per_user: #added 3/6/2025
max_line_size: 262144 #added 3/6/2025
max_line_size_truncate: true #added 3/6/2025
tsdb_max_query_parallelism: 128 #added 3/6/2025
query_timeout: 400s
volume_enabled: true
using gcs storage as backend

I am getting one error from backend but not sure if that would affect this:
msg=“unable to list rules” err=“googleapi: Error 403: **************************************** does not have storage.objects.list access to the Google Cloud Storage bucket. Permission ‘storage.objects.list’ denied on resource (or it may not exist)., forbidden”

Maybe, but regardless you should fix the permission issue.

I don’t see anything obviously wrong. I think your first order of business is to fix the permission, then try to narrow down where your problem is (is it alloy or Loki).

You said that Loki intermittently show logs for some log stream, this is a very specific behavior, can you reproduce this easily?

Yeah I am on the permissions issue.
Nevertheless, I did some digging and i see that alloy components loki_write are dropping bytes due to ingester error…one reason i found is that the ingestion_rate is an issue..
ingestion_burst_size_mb: 1200
ingestion_rate_mb: 1200
I found two solutions:

  1. Increasing these values but currently my loki_write is running with 2 gb request and 3 gb limit

2)is to implement write ahead log…

which one or if another one would you suggest and what other considerations would have to be taken for each?

I think the ingestion rate is purposely set low by default, and I would recommend you to do this:

  1. Increase both burst size and rate.
  2. Implement WAL on your Loki ingester.
  3. On alloy, implement back-off configuration. This is not as critical, but it’ll prevent you from losing logs when Loki can’t take the load. You’ll of course want to set up an alert for this when Alloy starts to hit the backoff, otherwise you risk Alloy going down, too. You should decide if this is worth the effort in your use case.

Have increased the ingestion rate to 3.5 gb and 4 gb burst while increasing loki-write resources, implemented wal and still getting same issue.

What error is your alloy agent giving you?

So loki is not showing any dropped samples.
Alloy is showing loki_write component dropping bytes due to “ingester error”, that is all that it says.
The Specific alloy logs show that intermittently thsi error:

ts=2025-03-27T12:01:32.220067158Z level=error msg=“final error sending batch” component_path=/ component_id=loki.write.default component=client host=loki-gateway.logging.svc.cluster.local status=413 tenant=“” error=“server returned HTTP status 413 Request Entity Too Large (413): ”

BTW for WAL to be implemented do I have to specify it under loki.write.endpoint.wal block?
Documentation says it is experimental…

  1. Try adjusting per_stream_rate_limit: 100M and per_stream_rate_limit_burst: 200M as well.
  2. Regarding WAL I was referring to WAL on Loki, not alloy.
  3. If HTTP request fails with 413 Loki should have logs for it. Do you have any reversed proxy in front of your Loki write containers?

Okay let me try the adjustment in values
I am using loki gateway but other than the default settings I haven’t added anything so i assume there isn’t a reverse proxy