I’m currently running Tempo on our Dockerswarm using the “local” storage backend. We are currently receiving between 500 and 1000 spans per second.
Every day or two, our host reports a massive spike in disk read activity, completely saturating the cpu load with iowait and eventually bringing down the swarm. If we stop sending to spans to Tempo, this problem goes away. If we move our instance of Tempo to a new swarm instance (and redirect the traffic to it), the issue follows with it. So, it’s definitely Tempo causing this behaviour. Tempo is limited in vCPU.
Sometimes, but seemingly not always, these spikes seem to line up with spikes in Tempo’s process_open_fds
.
Is the solution to move to a different storage backend? Has anyone else experienced/solved something similar?
This is my current Tempo config:
target: all
server:
http_listen_port: 3200
log_level: info
distributor:
receivers:
otlp:
protocols:
http:
grpc:
compactor:
compaction:
block_retention: 72h
compacted_block_retention: 15m
ingester:
max_block_duration: 15m
storage:
trace:
backend: local
block:
v2_encoding: zstd
wal:
path: /tmp/tempo/wal
v2_encoding: snappy
local:
path: /tmp/tempo/blocks
pool:
max_workers: 100
queue_depth: 10000