Tempo in Dockerswarm - Disk read activity spikes

I’m currently running Tempo on our Dockerswarm using the “local” storage backend. We are currently receiving between 500 and 1000 spans per second.

Every day or two, our host reports a massive spike in disk read activity, completely saturating the cpu load with iowait and eventually bringing down the swarm. If we stop sending to spans to Tempo, this problem goes away. If we move our instance of Tempo to a new swarm instance (and redirect the traffic to it), the issue follows with it. So, it’s definitely Tempo causing this behaviour. Tempo is limited in vCPU.

Sometimes, but seemingly not always, these spikes seem to line up with spikes in Tempo’s process_open_fds.

Is the solution to move to a different storage backend? Has anyone else experienced/solved something similar?

This is my current Tempo config:

target: all

server:
  http_listen_port: 3200
  log_level: info

distributor:
  receivers:
    otlp:
      protocols:
        http:
        grpc:

compactor:
  compaction:
    block_retention: 72h
    compacted_block_retention: 15m

ingester:
  max_block_duration: 15m

storage:
  trace:
    backend: local
    block:
      v2_encoding: zstd          
    wal:
      path: /tmp/tempo/wal             
      v2_encoding: snappy                 
    local:
      path: /tmp/tempo/blocks
    pool:
      max_workers: 100                 
      queue_depth: 10000

Hi :wave:, we don’t run tempo with local backend so can’t help you much here because we don’t have experience with it.

my hunch is that it could be compaction? or retention, tempo will delete data older then block_retention, and it might be just that…

We recommend using object store as a backend if you plan to use it at any sizable scale.

if you are not in cloud, you can try something like MinIO.

2 Likes