Promtail always has 5s delay before finding scrape targets

Hello,

What is the expected delay between starting promtail and it discovering the files to scrape? When starting a promtail container, there seems to always be a 5s delay before any files are scraped. This is also evident in the start-up logs, where there’s always a 5s delay between the msg="Adding target" log and the log entry before it.

This doesn’t appear to be influenced by the values of file_watch_config. Here is a reproducer:

promtail_config.yaml:

global:
  file_watch_config:
    min_poll_frequency: 250ms
    max_poll_frequency: 250ms
client:
  url: "http://localhost/"
server:
  http_listen_port: 3000
positions:
  filename: positions.yaml
scrape_configs:
- job_name: local
  static_configs:
  - labels:
      __path__: "/logs/*"

Running the container:

docker run \
  -v $(pwd)/promtail_config.yaml:/etc/promtail/config.yaml \
  -v /dev/null:/logs/logfile \
  --rm -p 3000:3000 \
  grafana/promtail:3.1.0 \
  -config.file /etc/promtail/config.yaml

Resulting logs (note the the timestamp of the msg="Adding targets" line)

level=info ts=2024-08-02T14:43:37.170161333Z caller=promtail.go:133 msg="Reloading configuration file" md5sum=aed4c5f0a9f7b415a38a04d653ac7145
level=warn ts=2024-08-02T14:43:37.1702245Z caller=config.go:70 msg="use of CLI client.* and config file Client block are both deprecated in favour of the config file Clients block and will be removed in a future release"
level=info ts=2024-08-02T14:43:37.170886167Z caller=server.go:352 msg="server listening on addresses" http=[::]:3000 grpc=[::]:9095
level=info ts=2024-08-02T14:43:37.170969375Z caller=main.go:173 msg="Starting Promtail" version="(version=3.1.0, branch=HEAD, revision=935aee77ed)"
level=warn ts=2024-08-02T14:43:37.171013208Z caller=promtail.go:263 msg="enable watchConfig"
level=info ts=2024-08-02T14:43:42.171635336Z caller=filetargetmanager.go:372 msg="Adding target" key=/logs/*:{}
level=info ts=2024-08-02T14:43:42.172528711Z caller=filetarget.go:343 msg="watching new directory" directory=/logs
level=info ts=2024-08-02T14:43:42.172995377Z caller=tailer.go:147 component=tailer msg="tail routine: started" path=/logs/logfile
ts=2024-08-02T14:43:42.173121086Z caller=log.go:168 level=info msg="Seeked /logs/logfile - &{Offset:0 Whence:0}"

Is this delay expected, and if so, can it be shortened with configuration? If the delay is not expected, I’m happy to open a bug report in GitHub instead.

Thanks in advance!

I don’t think there is any sort of forced delay, rather some sort of prep or init steps. I’d have to look at the code to be sure.

Could you elaborate why a 5-second delay is problematic to you? Also if you were thinking of contributing to promtail I believe it’s no longer being updated since Grafan is moving to Alloy as the default agent.

Thanks for looking @tonyswumac! I suspected that the 5s delay was some specific timer because it is consistently close to 5000ms with very little variance, which I wouldn’t expect if there was some non-timer-driven reason for the delay.

RE motivation, we use Hashicorp Nomad as our orchestration platform, with a promtail container as a sidecar task inside our allocations (Nomad’s equivalent of K8s Pods); this pattern is analogous to using sidecar containers in K8s.

The issue with a delay is that, for short-lived allocations (either because they are crashing on start-up, or because they’re for periodic jobs (Nomad’s CronJob equivalent) that start & finish in under 5s, the promtail task does not have enough time to scrape/send logs before it receives its shutdown signal.

We are working around this at the moment by creating our own promtail image that overrides the ENTRYPOINT with our own script that enforces a minimum 5s runtime before passing any shutdown signals along to promtail’s PID; but ideally, we would prefer to just use the official image without creating/maintaining our own, and the artificial delay means that we cannot have allocations that live for shorter than 5s, limiting the kinds of workflows we can build.

Good to know about Alloy! We are also exploring this, but I wanted to know if there was an interim solution we could put in place before doing a full-fledged migration to Alloy (I’m assuming but haven’t checked that Alloy does not have a similar delay)

I’d say give Alloy a try and see if that’s better.

This is just my opinion. In general I dislike relying on log pipeline that could be ephemeral. With a sidecar logging agent you could always run into the risk of logs being lost once containers are gone. I personally much prefer to have logs stored onto something physical as soon as possible (for example, json log driver from Docker), and then configure log agent separate from the containers (for example DAEMON set for K8s) to pick those up, and sidecard would be my least preferred solution.

We have some other reasons for using sidecars instead of system jobs (Nomad’s version of DaemonSets), but definitely good to be aware of the trade-offs involved. We’ll be experimenting with Alloy to see if that has the same issues in the short-term.

So I understand, is it fair to say that the cause of the 5s delay in the promtail image is currently unknown & unexpected, and if we see a similar issue in Alloy, we can treat/escalate this as a bug in Alloy?

I would move from the approach
“let something to discover me and scrape my metrics/logs”
to
“I push my metrics/logs/traces/…”
- that fits better for short time lived tasks.

I would recommend Opentelemetry approach - collector will be deployed as sidecar (and configured to push logs to your loki) and instrumented app to send logs to that sidecar container. Of course there is many other options how it can be configured/deployed, so just select option, which will fit your case/needs.

I think this is fair, at least you’ll get more eyes on Alloy than promtail.

Thanks for your suggestion @jangaraj! This is similar to the setup we have, except for the part that directly instruments the code that generates logs. We have use-cases where some of the code that is generating logs is third-party, and therefore there isn’t an opportunity to instrument it directly. Having a sidecar that asynchronously watches for logs allows us to keep a loose coupling between code (ours or third-party) and the observability platform.

What about log driver? E.g. Configure logging drivers | Docker Docs

Tony also alluded to this option earlier, and while that would work around the 5s delay issues, we have other reasons for using a sidecar here (e.g. tagging logs with information that would not be available at the logging driver level, auto-correlation between logs & traces, etc.). Thanks for the suggestion!

OK, so IMHO you need to eliminate that slow FS discovery.
I would try/test: logging via socket (so app writes logs to socket/sidecar reads logs from that socket). Maybe on the TCP level via syslog protocol - that may need modification of that third party app/container.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.