Tempo performance troubleshooting

Hello,

I’m evaluating Grafanf Tempo as our tracing solution. As of now, I’ve installed it in distributed mode and playing with full-backend search.
I execute the same query(search) for a 2hours period and tried to adjust some parameters.
The api_search latency on query_frontend and querier are quite high:
query_frontend:

querier:

At the same time, backend latency is much lower:

I tried to increase the number of queries (from 2 to 3) but it didn’t help at all.
Also played a bit with the configuration with recommendations here tempo/backend_search.md at main · grafana/tempo · GitHub But with the same 0 improvements

Can someone advise where to look at to reduce the search time

here is a diff of my config:

GET /status/config
---
compactor:
    compaction:
        block_retention: 168h0m0s
        max_block_bytes: 3221225472
    ring:
        kvstore:
            store: memberlist
distributor:
    receivers:
        kafka:
            auth:
                tls:
                    ca_file: /tmp/ca.crt
                    insecure: true
            brokers: kafka-cluster-kafka-bootstrap.kafka.svc.cluster.local:9093
            client_id: tempo-ingester
            encoding: otlp_proto
            group_id: tempo-ingester
            message_marking:
                after: true
                on_error: false
            protocol_version: 2.8.0
            topic: otlp-tracing
ingester:
    lifecycler:
        readiness_check_ring_health: false
        tokens_file_path: /var/tempo/tokens.json
memberlist:
    abort_if_cluster_join_fails: false
    dead_node_reclaim_time: 10s
    join_members:
        - grafana-tempo-tempo-distributed-gossip-ring
overrides:
    per_tenant_override_config: /conf/overrides.yaml
querier:
    frontend_worker:
        frontend_address: grafana-tempo-tempo-distributed-query-frontend-discovery:9095
        parallelism: 10
    max_concurrent_queries: 10
    search_query_timeout: 1m30s
query_frontend:
    query_shards: 30
    search:
        concurrent_jobs: 300
        max_duration: 2h0m0s
search_enabled: true
server:
    http_listen_port: 3100
    http_server_read_timeout: 2m0s
storage:
    trace:
        backend: s3
        cache: memcached
        local:
            path: /var/tempo/traces
        memcached:
            addresses: dns+memcached:11211
            circuit_breaker_consecutive_failures: 0
            circuit_breaker_interval: 0s
            circuit_breaker_timeout: 0s
            consistent_hash: true
            host: ""
            max_idle_conns: 16
            max_item_size: 0
            service: ""
            timeout: 1s
            ttl: 0s
            update_interval: 1m0s
        pool:
            max_workers: 100
            queue_depth: 10000
        s3:
            bucket: kubernetes-tracing-us-east-1
            endpoint: s3.amazonaws.com
        search:
            prefetch_trace_count: 20000
        wal:
            blocksfilepath: /var/tempo/wal/blocks
            completedfilepath: /var/tempo/wal/completed
target: querier

To make a meaningful impact on search time I would continue to increase some of the values you have already played with.

  • Scale tempo queriers (try 10, 20, 50)
  • Increase query_frontend.max_outstanding_per_tenant and query_frontend.search.concurrent_jobs (try 1000, 2000, 3000)
  • Increase querier.max_concurrent_queries
    • 10, 20 if not using serverless
    • 100 if using serverless

At some point the amount you have to scale will exceed the amount of unused resources you want to leave lying around at which point you should try serverless.

Ultimately, right now, Tempo search is just slow and resource intensive. We are currently working on an improved backend format which will make things faster. Until then the only choice will be to scale up massively and use serverless.

thanks a lot @joeelliott