Docker logs have timestamp too old

Hello all,

for the past weeks I’ve been an issue with Grafana Alloy.
I’m running it into a few machines as part of my home lab setup, where I want to collect the logs of some docker containers. I’ve noticed that, after a few days, my Grafana Cloud instance starts to receive a high volume of logs marked as being too old and as a result being discarded.
This in turns contribute to the total log ingestion which starts to grow completely out of control.
It appears as if Alloy tries to send old logs multiple times, because otherwise these docker containers would not produce such high volume of logs.
What seems to bring the situation under control is to stop the containers for a few minutes and restart them. This bring the crazy log ingestion down for a few days until is slowly starts growing again.
Before using Alloy I was using the Grafana Agent and did not have such issue. As the Agent will reach end of life at some point, I thought to start migrating to Alloy sooner.

I’ve checked the GitHub issue tracker and the forum but couldn’t really find something similar to this situation, any idea what might be happening?

Here is my Alloy configuration
          // Prometheus
          prometheus.exporter.self "integrations_alloy" { }

          discovery.relabel "integrations_alloy" {
            targets = prometheus.exporter.self.integrations_alloy.targets
          
            rule {
              target_label = "instance"
              replacement  = constants.hostname
            }
          
            rule {
              target_label = "alloy_hostname"
              replacement  = constants.hostname
            }
          
            rule {
              target_label = "job"
              replacement  = "integrations/alloy-check"
            }
          }
          
          prometheus.scrape "integrations_alloy" {
            targets    = discovery.relabel.integrations_alloy.output
            forward_to = [prometheus.relabel.integrations_alloy.receiver]
          
            scrape_interval = "60s"
          }
          
          prometheus.relabel "integrations_alloy" {
            forward_to = [prometheus.remote_write.metrics_service.receiver]
          
            rule {
              source_labels = ["__name__"]
              regex         = "(prometheus_target_sync_length_seconds_sum|prometheus_target_scrapes_.*|prometheus_target_interval.*|prometheus_sd_discovered_targets|alloy_build.*|prometheus_remote_write_wal_samples_appended_total|process_start_time_seconds)"
              action        = "keep"
            }
          }
          
          prometheus.remote_write "metrics_service" {
            endpoint {
              url = "{{ prometheus_url }}"
              basic_auth {
                username = "{{ prometheus_username }}"
                password = "{{ prometheus_api_key }}"
              }
            }
          }
          
          loki.write "grafana_cloud_loki" {
            endpoint {
              url = "{{ loki_url }}"
              basic_auth {
                username = "{{ loki_username }}"
                password = "{{ loki_api_key }}"
              }
            }
          }
          discovery.relabel "integrations_node_exporter" {
            targets = prometheus.exporter.unix.integrations_node_exporter.targets
          
            rule {
              target_label = "instance"
              replacement  = constants.hostname
            }
          
            rule {
              target_label = "job"
              replacement = "integrations/node_exporter"
            }
          }
          
          prometheus.exporter.unix "integrations_node_exporter" {
            disable_collectors = ["ipvs", "btrfs", "infiniband", "xfs", "zfs"]
          
            filesystem {
              fs_types_exclude     = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
              mount_points_exclude = "^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+)($|/)"
              mount_timeout        = "5s"
            }
          
            netclass {
              ignored_devices = "^(veth.*|cali.*|[a-f0-9]{15})$"
            }
          
            netdev {
              device_exclude = "^(veth.*|cali.*|[a-f0-9]{15})$"
            }
          }
          
          prometheus.scrape "integrations_node_exporter" {
            targets    = discovery.relabel.integrations_node_exporter.output
            forward_to = [prometheus.relabel.integrations_node_exporter.receiver]
          }
          
          prometheus.relabel "integrations_node_exporter" {
            forward_to = [prometheus.remote_write.metrics_service.receiver]
            
            rule {
              source_labels = ["__name__"]
              regex         = "up|node_arp_entries|node_boot_time_seconds|node_context_switches_total|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_io_time_weighted_seconds_total|node_disk_read_bytes_total|node_disk_read_time_seconds_total|node_disk_reads_completed_total|node_disk_write_time_seconds_total|node_disk_writes_completed_total|node_disk_written_bytes_total|node_filefd_allocated|node_filefd_maximum|node_filesystem_avail_bytes|node_filesystem_device_error|node_filesystem_files|node_filesystem_files_free|node_filesystem_readonly|node_filesystem_size_bytes|node_intr_total|node_load1|node_load15|node_load5|node_md_disks|node_md_disks_required|node_memory_Active_anon_bytes|node_memory_Active_bytes|node_memory_Active_file_bytes|node_memory_AnonHugePages_bytes|node_memory_AnonPages_bytes|node_memory_Bounce_bytes|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_CommitLimit_bytes|node_memory_Committed_AS_bytes|node_memory_DirectMap1G_bytes|node_memory_DirectMap2M_bytes|node_memory_DirectMap4k_bytes|node_memory_Dirty_bytes|node_memory_HugePages_Free|node_memory_HugePages_Rsvd|node_memory_HugePages_Surp|node_memory_HugePages_Total|node_memory_Hugepagesize_bytes|node_memory_Inactive_anon_bytes|node_memory_Inactive_bytes|node_memory_Inactive_file_bytes|node_memory_Mapped_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_memory_SReclaimable_bytes|node_memory_SUnreclaim_bytes|node_memory_ShmemHugePages_bytes|node_memory_ShmemPmdMapped_bytes|node_memory_Shmem_bytes|node_memory_Slab_bytes|node_memory_SwapTotal_bytes|node_memory_VmallocChunk_bytes|node_memory_VmallocTotal_bytes|node_memory_VmallocUsed_bytes|node_memory_WritebackTmp_bytes|node_memory_Writeback_bytes|node_netstat_Icmp6_InErrors|node_netstat_Icmp6_InMsgs|node_netstat_Icmp6_OutMsgs|node_netstat_Icmp_InErrors|node_netstat_Icmp_InMsgs|node_netstat_Icmp_OutMsgs|node_netstat_IpExt_InOctets|node_netstat_IpExt_OutOctets|node_netstat_TcpExt_ListenDrops|node_netstat_TcpExt_ListenOverflows|node_netstat_TcpExt_TCPSynRetrans|node_netstat_Tcp_InErrs|node_netstat_Tcp_InSegs|node_netstat_Tcp_OutRsts|node_netstat_Tcp_OutSegs|node_netstat_Tcp_RetransSegs|node_netstat_Udp6_InDatagrams|node_netstat_Udp6_InErrors|node_netstat_Udp6_NoPorts|node_netstat_Udp6_OutDatagrams|node_netstat_Udp6_RcvbufErrors|node_netstat_Udp6_SndbufErrors|node_netstat_UdpLite_InErrors|node_netstat_Udp_InDatagrams|node_netstat_Udp_InErrors|node_netstat_Udp_NoPorts|node_netstat_Udp_OutDatagrams|node_netstat_Udp_RcvbufErrors|node_netstat_Udp_SndbufErrors|node_network_carrier|node_network_info|node_network_mtu_bytes|node_network_receive_bytes_total|node_network_receive_compressed_total|node_network_receive_drop_total|node_network_receive_errs_total|node_network_receive_fifo_total|node_network_receive_multicast_total|node_network_receive_packets_total|node_network_speed_bytes|node_network_transmit_bytes_total|node_network_transmit_compressed_total|node_network_transmit_drop_total|node_network_transmit_errs_total|node_network_transmit_fifo_total|node_network_transmit_multicast_total|node_network_transmit_packets_total|node_network_transmit_queue_length|node_network_up|node_nf_conntrack_entries|node_nf_conntrack_entries_limit|node_os_info|node_sockstat_FRAG6_inuse|node_sockstat_FRAG_inuse|node_sockstat_RAW6_inuse|node_sockstat_RAW_inuse|node_sockstat_TCP6_inuse|node_sockstat_TCP_alloc|node_sockstat_TCP_inuse|node_sockstat_TCP_mem|node_sockstat_TCP_mem_bytes|node_sockstat_TCP_orphan|node_sockstat_TCP_tw|node_sockstat_UDP6_inuse|node_sockstat_UDPLITE6_inuse|node_sockstat_UDPLITE_inuse|node_sockstat_UDP_inuse|node_sockstat_UDP_mem|node_sockstat_UDP_mem_bytes|node_sockstat_sockets_used|node_softnet_dropped_total|node_softnet_processed_total|node_softnet_times_squeezed_total|node_systemd_unit_state|node_textfile_scrape_error|node_time_zone_offset_seconds|node_timex_estimated_error_seconds|node_timex_maxerror_seconds|node_timex_offset_seconds|node_timex_sync_status|node_uname_info|node_vmstat_oom_kill|node_vmstat_pgfault|node_vmstat_pgmajfault|node_vmstat_pgpgin|node_vmstat_pgpgout|node_vmstat_pswpin|node_vmstat_pswpout|process_max_fds|process_open_fds"
              action        = "keep"
            }
          }
          
          // Loki
          loki.source.journal "logs_integrations_integrations_node_exporter_journal_scrape" {
            max_age       = "24h0m0s"
            relabel_rules = discovery.relabel.logs_integrations_integrations_node_exporter_journal_scrape.rules
            forward_to    = [loki.write.grafana_cloud_loki.receiver]
          }
          
          local.file_match "logs_integrations_integrations_node_exporter_direct_scrape" {
            path_targets = [{
              __address__ = "localhost",
              __path__    = "/var/log/{syslog,messages,*.log}",
              instance    = constants.hostname,
              job         = "integrations/node_exporter",
            }]
          }
          
          discovery.relabel "logs_integrations_integrations_node_exporter_journal_scrape" {
            targets = []
          
            rule {
              source_labels = ["__journal__systemd_unit"]
              target_label  = "unit"
            }
          
            rule {
              source_labels = ["__journal__boot_id"]
              target_label  = "boot_id"
            }
          
            rule {
              source_labels = ["__journal__transport"]
              target_label  = "transport"
            }
          
            rule {
              source_labels = ["__journal_priority_keyword"]
              target_label  = "level"
            }

            rule {
              source_labels = ["__journal_syslog_identifier"]
              target_label  = "syslog_identifier"
            }

            rule {
              target_label = "instance"
              replacement  = constants.hostname
            }

          }
          
          loki.source.file "logs_integrations_integrations_node_exporter_direct_scrape" {
            targets    = local.file_match.logs_integrations_integrations_node_exporter_direct_scrape.targets
            forward_to = [loki.write.grafana_cloud_loki.receiver]
          }

          // Docker
           prometheus.exporter.cadvisor "integrations_cadvisor" {
             docker_only = true
           }
           discovery.relabel "integrations_cadvisor" {
               targets = prometheus.exporter.cadvisor.integrations_cadvisor.targets
           
               rule {
                   target_label = "job"
                   replacement  = "integrations/docker"
               }
           
               rule {
                   target_label = "instance"
                   replacement  = constants.hostname
               }
           }
           
           prometheus.relabel "integrations_cadvisor" {
           	forward_to = [prometheus.remote_write.metrics_service.receiver]
           
           	rule {
           		source_labels = ["__name__"]
           		regex         = "up|container_cpu_usage_seconds_total|container_fs_inodes_free|container_fs_inodes_total|container_fs_limit_bytes|container_fs_usage_bytes|container_last_seen|container_memory_usage_bytes|container_network_receive_bytes_total|container_network_tcp_usage_total|container_network_transmit_bytes_total|container_spec_memory_reservation_limit_bytes|machine_memory_bytes|machine_scrape_error"
           		action        = "keep"
           	}
           }
           
           prometheus.scrape "integrations_cadvisor" {
               targets    = discovery.relabel.integrations_cadvisor.output
               forward_to = [prometheus.relabel.integrations_cadvisor.receiver]
           }
           
           discovery.docker "logs_integrations_docker" {
               host             = "unix:///var/run/docker.sock"
               refresh_interval = "5s"
           }

           discovery.relabel "logs_integrations_docker" {
               targets = []
           
               rule {
                   target_label = "job"
                   replacement  = "integrations/docker"
               }
           
               rule {
                   target_label = "instance"
                   replacement  = constants.hostname
               }
           
               rule {
                   source_labels = ["__meta_docker_container_name"]
                   regex         = "/(.*)"
                   target_label  = "container"
               }
           
               rule {
                   source_labels = ["__meta_docker_container_log_stream"]
                   target_label  = "stream"
               }
           }

          loki.source.docker "logs_integrations_docker" {
              host             = "unix:///var/run/docker.sock"
              targets          = discovery.docker.logs_integrations_docker.targets
              forward_to       = [loki.write.grafana_cloud_loki.receiver]
              relabel_rules    = discovery.relabel.logs_integrations_docker.rules
              refresh_interval = "5s"
          }

I don’t have the full Agent configuration, but here is the snippet I was using to collect docker logs

Grafana Agent docker logs
- docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
      job_name: integrations/docker
      relabel_configs:
      - replacement: yavin4
        target_label: instance
      - replacement: integrations/docker
        target_label: integration
      - source_labels:
        - __meta_docker_container_id
        target_label: containerid
      - regex: /(.*)
        source_labels:
        - __meta_docker_container_name
        target_label: container
      - source_labels:
        - __meta_docker_container_log_stream
        target_label: stream
      - source_labels:
        - __meta_docker_container_label_appname
        target_label: appname
      - source_labels:
        - __meta_docker_container_label_env
        target_label: env
      - source_labels:
        - __meta_docker_container_label_custom
        target_label: custom
    target_config:
      sync_period: 10s

Your agent and alloy config is missing.

You are right my apologies, I’ve edited my main post and included my config.

1 Like

Although you didn’t specify how the Alloy container is executed, I suspect your problem is due to state storage.

Alloy, like any other logging agent, needs to keep a state of the files it has scraped and their positions. In a bare-metal installation this is not a problem, but if you run Alloy as a container then unless you use a bind mount you will lose that state every time Alloy container restarts, thereby generating a lot of log data due to Alloy re-scraping all log files again.

So setup a host bind path, and mount it into the container, and use it for storage. Something like this:

mkdir /opt/alloy/storage
chmod 777 /opt/alloy/storage

docker run \
  -v <CONFIG> \
  -v rw:/opt/alloy/storage:/opt/alloy/storage
  grafana/alloy:latest \
    run --storage.path=/opt/alloy/storage \
    <OTHER_FLAGS>
1 Like

Hey @tonyswumac , I didn’t not specify how I’m running it as a container because I’m not running it as a container :slight_smile: but I’m using it to collect logs from all the other containers.

I see, then it’s probably not the storage. Couple of things I’d suggest trying:

  1. Take a look at Alloy storage (default is /var/lib/alloy/data if i remember correctly) and see if the positions file correctly recorded the docker containers being scraped for logs and their positions. If not this could be a bug. I’ll also do some testing on this when I have some time.
  2. Try changing your Loki configuration to scrape from files directly instead of docker unix socket. You’ll lose some labeling information from Docker, so I’d suggest testing it out first. We directly scrape from Docker logs with Alloy, and we supply docker labels during deployment to make up for this.

Thanks for the suggestions.

  1. If I’m not mistaken the position file is in /etc/alloy, where I can see that there is a folder called loki.source.docker.logs_integrations_docker with inside a file named positions.yml. I’m not sure how to read the content of positions.yml but at first glance it seems like it’s scaping stuff alright since I see also logs coming into Grafana Cloud. Anyway I’ve also checked the content of /var/lib/alloy/data which is this
-rw-r--r-- 1 alloy alloy 116 May  4 12:16 alloy_seed.json
drwxr-xr-x 1 alloy alloy   6 May  4 12:16 prometheus.remote_write.metrics_service
drwxr-x--- 1 alloy alloy   0 May  4 12:16 remotecfg

  1. I guess I could try, but I’ve actually followed the configuration setup suggested by Grafana Cloud when enabling the docker integration, it’s a simple copy paste of the docker integration. The only change I’ve made was to add a custom label to have the instance name along with all the other labels.

I’ve stopped alloy and all containers, removed the position file, restart the instance. The discarded logs are slowing still increasing until they eventually grow out of control.
For completeness, I have only two containers in this instance:

  • caddy configured as referce proxy;
  • plex server that I use to stream my personal library.

  1. I did a quick local test and I wasn’t able to reproduce your problem. I’ll do another test some time today with our test Loki cluster and try again.
  2. Have you determined the discarded logs are indeed from Docker logs? If you comment out the Docker related logs (keep only file and journal from your config) do you still see discarded logs from Grafana Cloud? Also were you able to determine the discard reason?
  3. You can try adding a loki.process block in your docker logs pipeline and try to discard older logs. This is of course a work-around (if it works), not advocating for you to include this permanently:
  stage.drop {
    source = "timestamp"
    older_than = "15m"
    drop_counter_reason = "too old"
  }

Yes, I tried to comment out the configuration for Docker and confirmed that that is what is causing the discarded logs.

The discarded reason is that the timestamp is too old, here is an example of the error logged by querying grafana-usage-insight data source.

level=error ts=2024-07-03T20:56:46.819721559Z caller=manager.go:49 component=distributor path=write insight=true msg="write operation failed" details="entry for stream '{container=\"caddy\", instance=\"hoth\", job=\"integrations/docker\", service_name=\"caddy\", stream=\"stderr\"}' has timestamp too old: 2024-06-26T17:02:06Z, oldest acceptable timestamp is: 2024-06-26T20:56:46Z" org_id=129557

I did another more thorough test, and I still could not reproduce this. I included some details on what I did below.

One thing that just occurred to me, is your Docker logs in JSON format? Are you parsing your Docker logs at all? If you aren’t, you should, at least for the timestamp part, because this could be your problem. For example, try adding this to your configuration:

discovery.docker "logs_integrations_docker" {
  host             = "unix:///var/run/docker.sock"
  refresh_interval = "5s"
}

discovery.relabel "logs_integrations_docker" {
  targets = []
           
  rule { ... } # this is your rule section, not going to repeat it.
}

loki.source.docker "logs_integrations_docker" {
  host             = "unix:///var/run/docker.sock"
  targets          = discovery.docker.logs_integrations_docker.targets
  // forward_to       = [loki.write.grafana_cloud_loki.receiver]
  forward_to       = [loki.process.docker_local.receiver]
  relabel_rules    = discovery.relabel.logs_integrations_docker.rules
  refresh_interval = "5s"
}

loki.process "docker_local" {
  forward_to = [loki.write.grafana_cloud_loki.receiver]

  stage.json {
    expressions = {
      time = "",
    }
  }

  stage.timestamp {
    format = "RFC3339Nano"
    source = "time"
  }
}

My test environment:

  1. A docker container outputting about 100 random logs per second.
  2. Copy-pasted your configuration as closely as I could (the docker and loki part), and send to our loki cluster via Alloy.
  3. Observe metrics.

Loki log volume:

Discarded log metric:

Logs are actually in json format, I can try your suggestion. Could you also share what container are you using to generate the random logs so that I can try a similar setup too?

I just used a python script that essential does this (from my memory, I don’t have it running anymore):

i = 1

while True:
    print(f"log {i})
    if i > 10000000:
        i = 1
    else
        i += 1
    sleep(10 / 1000)

Ok so I did some more tests and here some info:
I created a container with a script similar to yours but it would not trigger any visible issue. This made me think about the different condition that my caddy and plex container are in, which is that they are on the same docker network (I have absolutely no idea why this should be relevant, but it is nonetheless a difference). So what I did was:

  • enabled logging in my Caddyfile;
  • create a new container in the same network as caddy;
  • while looping a curl request towards caddy using docker DNS, so essentially I run a while true; curl caddy.

This seems to have created a situation where Grafana Cloud is discarding logs as been too far behind. Caddy’s log are in JSON format and for this test I did not try your suggestion on parsing JSON logs 'cause I wanted to find a way to replicate the issue. Also, I’m not sure I fully understand why parsing the timestamp should make a difference. I expect Alloy is send the log the moment it is available to it, which means the timestamp in the JSON has already been generated and is probably a few seconds older? does that make sense?

Interesting. So essentially if there are two containers running on the same network both being scraped for logs would trigger this.

I’ll have to test this out. Are you able to quickly run a test with your test example by running two containers with the simple script on the same network and see what happens?

Also, I suggested parsing json for timestamp to see if perhaps Alloy is retrying logs from Docker for some reason. If it is, then without parsing for timestamp it’d treat all the logs with the same timestamp (time of collection). I’d still encourage you to try and parse for timestamp and see what happens.

So I did as you suggested.

I created two containers, both on the same network.
One is running Caddy logging all requests, the other one is running ubuntu where I manually run curl caddy on a loop.

If I don’t parse the JSON log, I get a high level of logs discarded. If I stop everything and tell Alloy to parse JSON file as per your suggestion, this does not seem to happen anymore.

It seems your hypothesis that Alloy is retrying logs from Docker might be accurate, but I’m still having some difficulty understanding what is actually happening. Without parsing the timestamp from JSON, I assumed that Alloy would add a time of collection to a log line once collected, and then attempt to send that log with the time of collection as the timestamp. If this is correct, I would assume the time of collection to be “closer to the current time” compared to the timestamp in the JSON, 'cause the time of collection will happen later in time. So how can it be that without parsing the timestamp from JSON, logs get discarded because too far behind??
This confuses me :thinking:

Let me guess that those unparsed logs have some local timezone (non UTC) timestamp. Parsed have proper UTC (or at least with proper timezone) timestamp.

I’m afraid I spoke too soon :frowning:

Interesting. I’ll try to put together some test case tomorrow. Meanwhile can you see if adding a drop stage with parsed timestamp works or not?

  stage.drop {
    source = "timestamp" # or whatever label you parse timestamp into
    older_than = "15m"
    drop_counter_reason = "too old"
  }

This seems to work. This is almost 1 hour of receiving logs from Caddy in the same way I did before and there is no discarded logs.