Cannot find traceids in S3 blocks

Traces cannot be found in S3 after they have been flushed from the ingesters.
The query will return a 404 instead.

Our 0.5.0 install seemed to work fine for awhile (30+ days) until we noticed we
could not pull traces from S3 about 30mins after they were generated. The
querier logs msg="failed to retrieve block meta" and found=false messages
while looking in blocks for the traceids.

Some of the blocks in S3 seemed to be incomplete. Only containing bloom-# and
index files, but not meta.json or meta.compacted.json files.

I shutdown all the app components, wiped the S3 bucket contents, and brought it
back up with version 0.6.0. Seemed to work again for about a week. I was initially
able to pull traces older than 45mins. The queriers logged found=true messages
for queries. After about a week, the 404s started to occur again.

The only outlier I’ve found so far is that the compactor started logging
msg="failed to get ring" and had no live replicas in the ring. I’ve shutdown
all the components and restarted the memberlist ring, but that did not help.

tempo configmap (distributor/ingester/compactor)

data:
  overrides.yaml: |
    overrides: {}
  tempo.yaml: |
    auth_enabled: false

    compactor:
      compaction:
          block_retention: 360h               # duration to keep blocks
          compacted_block_retention: 1h       # duration to keep blocks that have been compacted elsewhere
          compaction_window: 1h               # blocks in this time window will be compacted together
          chunk_size_bytes: 10485760          # amount of data to buffer from input blocks
          flush_size_bytes: 31457280          # flush data to backend when buffer is this large
          max_block_bytes: 107374182400       # Maximum size of a compacted block in bytes. 100GB
      ring:
          kvstore:
              store: memberlist

    distributor:
      receivers:
        jaeger:
          protocols:
            thrift_http:
            grpc:
            thrift_binary:
            thrift_compact:
        opencensus: null
        otlp:
          protocols:
            http:
            grpc:
        zipkin:

    ingester:
      lifecycler:
        ring:
          replication_factor: 2
      trace_idle_period: 30s
      max_block_bytes: 1_000_000_000
      max_block_duration: 1h

    memberlist:
      abort_if_cluster_join_fails: false
      bind_port: 7946
      join_members:
        - tempo-gossip-ring.telemetry.svc.cluster.local:7946
      # rejoin_interval: 15m

    overrides:
      max_traces_per_user: 1000000
      per_tenant_override_config: /etc/tempo/overrides.yaml

    server:
      http_listen_port: 3100

    storage:
      trace:
        backend: s3
        blocklist_poll: 5m
        block:
          encoding: zstd
        cache: redis
        pool:
          max_workers: 50
          queue_depth: 2000
        redis:
          endpoint: tempo-redis:6379
          timeout: 500ms
        s3:
          bucket: example-prd-tempo-storage
          endpoint: s3.us-west-2.amazonaws.com
          insecure: true
          # region: us-west-2
        wal:
          # bloom_filter_false_positive: .05 # bloom filter false positive rate.  lower values create larger filters but fewer false positives
          path: /var/tempo/wal

tempo-query configmap

data:
  tempo.yaml: |
    auth_enabled: false
    ingester:
      lifecycler:
        ring:
          replication_factor: 2
    memberlist:
      abort_if_cluster_join_fails: false
      bind_port: 7946
      join_members:
        - tempo-gossip-ring.telemetry.svc.cluster.local:7946
    overrides:
      per_tenant_override_config: /conf/overrides.yaml
    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend-discovery.telemetry.svc.cluster.local:9095
    server:
      http_listen_port: 3100
      log_level: info
    storage:
      trace:
        backend: s3
        block:
          encoding: zstd
        blocklist_poll: 5m
        cache: redis
        redis:
          endpoint: tempo-redis:6379
          timeout: 500ms
        s3:
          bucket: example-prd-tempo-storage
          endpoint: s3.us-west-2.amazonaws.com
          insecure: true
          # region: us-west-2
        pool:
            max_workers: 200
            queue_depth: 2000
        wal:
            path: /var/tempo/wal

tempo-query-frontend configmap

data:
  tempo.yaml: |
    auth_enabled: false
    ingester:
      lifecycler:
        ring:
          replication_factor: 2
    memberlist:
      abort_if_cluster_join_fails: false
      bind_port: 7946
      join_members:
        - tempo-gossip-ring.telemetry.svc.cluster.local:7946
    overrides:
      per_tenant_override_config: /conf/overrides.yaml
    query_frontend:
      query_shards: 10
    server:
      http_listen_port: 3100
      log_level: info
    storage:
      trace:
        backend: s3
        blocklist_poll: 5m
        block:
          encoding: zstd
        cache: redis
        redis:
          endpoint: tempo-redis:6379
          timeout: 500ms
        s3:
          bucket: example-prd-tempo-storage
          endpoint: s3.us-west-2.amazonaws.com
          insecure: true
          # region: us-west-2
        pool:
            max_workers: 200
            queue_depth: 2000
        wal:
            path: /var/tempo/wal

So this is obviously quite strange and nothing we’ve seen ourselves. We do use gcs though and not s3 so perhaps this is something related to s3. The compactor log message is concerning, but first I want to focus on the partial blocks.

Can you share the following:

  • Value of the metrictempodb_blocklist_length (only queriers/compactors will have this)
  • How many block guid “subfolders” are in your s3 bucket? s3://example-prd-tempo-storage/single-tenant/*
  • Do your compactors ever restart unexpectedly either due to failure or OOM?
  • List of multipart uploads: aws s3api list-multipart-uploads --bucket example-prd-tempo-storage
  • Can you attach 10 or so minutes of compactor/querier logs? I want to see if anything unexpected is happening with polling.

My working theory is that compactors are for some reason creating partial blocks and overloading the polling job. This information will help us determine if that’s true.

This is apparently not true. After the compactor ran over the weekend and caught up, I’m able to locate traces that last week could not be found. I think keeping the compactors healthy and running is the source of my problems.

8989 now that the compactor has started working again.

Not often, but yes they have before. The transition to 0.6.0 and enabling compression consumed more memory. They OOM’ed for bit while I was re-sizing them to fit the new demand. I think the failures/oom is part of the reason they end up with a broken ring. I might need to move them to a statefulset.

Here is the blocklist length over the last 14 days:

There are 59 multipart uploads. All of them single-tenant/<uuid>/data files. Here is a snippet of the multipart uploads:

{
            "UploadId": "AmaeQMYvFazW1uRJmfz9olCqFd1.HZcgc0dCiMdNNve7rmuPYp3op_ogwVyf3OQwL68cGNh3Ian7jdwhpayi_kSLh95iZBP4cu2iWh.nNPYfbtuBf8iBZlsTIVw42R7QFJa_XNG2bbaDZjp1lpQLCA--",
            "Key": "single-tenant/024f6943-9264-469c-b95c-8d3789d33885/data",
            "Initiated": "2021-03-18T12:25:15.000Z",
            "StorageClass": "STANDARD",
        },
        {
            "UploadId": "u51Ix61fIYTxLlzJWeWkTa9xPUuf24WgLqA91Ph.hFlE26wA3xQ3gqXnrukGdFNWUKTFjwbMSKe8EgFW65b1SYuhLLDUf97rvyeh6WpqcRhQGQ68Fot4T4s9t_A2LUK_KoNPaFNsXIYC9MClsgGD.A--",
            "Key": "single-tenant/03bf6ab5-f496-4ca2-a487-a2fadba8f564/data",
            "Initiated": "2021-03-24T11:37:17.000Z",
            "StorageClass": "STANDARD",
          }
        },
        {
            "UploadId": "c5e86XQk64LpYc7F_HVed29DtSLNHK7jflD1DUmy4vk3DBWjQK5p.UACnWpWUGCebe7xuuIHATNMtqJMuzFKVG0SsfOJVkurJWdRBf.YtjB.68onlqRA0dflAlmIFn78.irdZByxujWAozxLK3nrvQ--",
            "Key": "single-tenant/05204c6a-ecd6-41ed-b695-12c68eadffaf/data",
            "Initiated": "2021-03-25T16:51:28.000Z",
            "StorageClass": "STANDARD",
          
        },

Glad to hear that things are getting better. It looks like your blocklist is ~900 blocks but you’re seeing 8989 guids under your tenant path. These values should be roughly the same with maybe just 10-20% more guids than blocklist length. Possibly a Tempo bug related to s3?

Your compacted_block_retention appears to be set at 1h so a block that is compacted should be cleaned up after 1 hour. This, however, is all based on a functioning compactor ring. Do you have any unhealthy compactors at the moment?

If you have no unhealthy compactors for awhile and we’re still seeing nearly 8000 extra guids under your tenant then in your blocklist we need to investigate the cause of this. Best guesses:

  • Compactors are not correctly detecting that a compacted block should be deleted and removing it.
  • Compactors are regularly uploading partial blocks. This seems unlikely b/c you only have 59 incomplete multipart uploads (which are all probably either legit or due to the OOMs you mentioned).

I’d be interested in the following metrics as well as compactor logs to help determine the cause:

sum(increase(tempodb_retention_deleted_total{}[1h]))
sum(increase(tempodb_retention_marked_for_deletion_total{}[1h]))

sum(increase(tempodb_compaction_errors_total{}[1h])) by (job)
sum(increase(tempodb_retention_errors_total{}[1h])) by (job)
sum(increase(tempodb_blocklist_poll_errors_total{}[1h])) by (job)

And, final question, of your 8000 “ghost blocks” that should not exist are there any obvious patterns? Maybe always the same files are leftover or something?

The mention that index/bloom/meta files exist but the data file doesn’t sounded very suspicious, and indeed digging into it there is something I might classify as a “bug”. The general design when Tempo writes a block is to write the large data files first because they are the most resource intensive and likely to fail. When this happens, as was the case with the OOMs, cleanup is easily handled with a bucket policy to abort multipart uploads quickly, and because none of the other files were written either, nothing else needs to be done.

Here the compactor block is actually completing the multipart upload last: tempo/compactor_block.go at master · grafana/tempo · GitHub This should be changed to be done before writing the index/bloom/meta.

I’m not sure why the last call would lead to OOMs. Would be interested to see if any errors are logged.

To add to this conversation, we have had continuous issues with unhealthy compactors while using memberlist: Unhealthy Compactors Stay in the Ring · Issue #142 · grafana/tempo · GitHub

I would say that things are better than they have ever been. It still happens occassionally but we have found that repeatedly forgetting the unhealthy compactor from a few different processes in the memberlist cluster seems to work well. Here are our current settings that have helped stabilize us:

memberlist:
    leave_timeout: 10s
    left_ingesters_timeout: 30m
    pull_push_interval: 10s

Sorry for the slow response. Had to juggle a work summit last week.

The number of “ghost blocks” is increasing despite the compactors staying healthy. Now I’m up to 12584 when I run aws s3 ls example-prd-tempo-storage/single-tenant/ | wc -l. I’ve also noticed I’ve got a lot of blocks from just a couple days ago that are not compacted even though the compactors have been healthy. So far the only thing I’ve noticed is in common is that most of them have a meta.json instead of a meta.compacted.json.

Here are the metrics you asked for over the last 2 days with health compactors:

Only two compaction errors were reported. No retention or polling errors. The errors logged these message:

msg="unable to mark block compacted" blockID=1bfd63e7-605f-4fcf-bc88-1a88be88170f tenantID=single-tenant err="error copying obj meta to compacted obj meta: The specified key does not exist."
msg="unable to mark block compacted" blockID=364cabc0-736b-4651-88c3-8017b7686c32 tenantID=single-tenant err="error copying obj meta to compacted obj meta: The specified key does not exist."

I’ve added the memberlist options you mentioned to all of the Tempo ring components. No stability issues so far.

The presence of meta.json means the block hasn’t been compacted yet, which is true for every live block. After compaction it is renamed to meta.compacted.json, and the block is deleted a little later after the compacted_block_retention setting which looks set to 1h.

I think we need more information to tell if there is still an ongoing problem, and for the errors in the past, what state the bucket data is in and what correction actions will be necessary if any.

Let’s see if we can try to determine some patterns among the 12584 blocks.

(1) Get counts on the number of blocks with good data, index, meta.json, meta.compacted.json with a command like aws s3 ls --recursive example-prd-tempo-storage/single-tenant/ | grep data | wc -l This will help determine what parts are missing and how frequently.

(2) Try running tempo-cli list compaction-summary. It prints a table with statistics per compaction level that will help us to see what compaction has achieved so far. I think this will require building from source.

(3) What is the value for sum(rate(tempo_distributor_spans_received_total[1h]))? Would like to understand the throughput of the system, to decide what an appropriate block length should be.

(4) Connect to a compactor and check http://localhost:3100/compactor/ring. This shows the members of the rings. See if there are any unhealthy compactors.

(5) The error error copying obj meta to compacted obj meta: The specified key does not exist. happens when the meta.json is deleted while the block is being compacted, and at the end of compaction when it tries to rename it is missing. There are a couple explanations (a) another compactor was confused and grabbed the same block, which should not happen because of ring sharding (there is a small window where this can happen under normal circumstances but it is rare) (b) the s3 bucket lifecycle policy itself happened to delete it at the same time. Does your s3 bucket have a lifecycle policy or other automated things that could touch the tempo files?

The issue of dropping the data objects may not be ongoing, but the number of blocks is definitely still growing.

$ aws s3 ls example-prd-tempo-storage/single-tenant/ | wc -l
13365
$ aws s3 ls --recursive example-prd-tempo-storage/single-tenant/ | grep data | wc -l
13363
$ aws s3 ls --recursive example-prd-tempo-storage/single-tenant/ | grep "meta.json" | wc -l
12859
$ aws s3 ls --recursive example-prd-tempo-storage/single-tenant/ | grep "meta.compacted.json" | wc -l
515
$ tempo-cli list compaction-summary single-tenant  --bucket=example-prd-tempo-storage --backend=s3 --s3-endpoint=s3.us-west-2.amazonaws.com

total blocks:  13383

<snip>

Stats by compaction level:

+-----+-------------+--------------------------------+----------------------------+----------------------------+----------------+----------------+
| LVL |   BLOCKS    |             TOTAL              |       SMALLEST BLOCK       |       LARGEST BLOCK        |    EARLIEST    |     LATEST     |
+-----+-------------+--------------------------------+----------------------------+----------------------------+----------------+----------------+
|   0 | 9150 (71 %) | 3,227,659,390 objects (847 GB) | 211,097 objects (72 MB)    | 518,043 objects (106 MB)   | 399h52m48s ago | 4m54s ago      |
|   1 | 1304 (10 %) | 1,797,054,206 objects (466 GB) | 472,178 objects (159 MB)   | 3,348,177 objects (767 MB) | 499h46m59s ago | 8m20s ago      |
|   2 | 944 (7 %)   | 2,647,356,142 objects (677 GB) | 737,332 objects (245 MB)   | 5,836,468 objects (1.8 GB) | 499h39m18s ago | 12m20s ago     |
|   3 | 727 (5 %)   | 3,056,326,322 objects (788 GB) | 1,051,586 objects (325 MB) | 5,972,932 objects (2.0 GB) | 521h24m6s ago  | 1h2m28s ago    |
|   4 | 337 (2 %)   | 1,331,946,965 objects (358 GB) | 1,373,772 objects (409 MB) | 5,983,879 objects (2.0 GB) | 492h54m22s ago | 47h48m10s ago  |
|   5 | 198 (1 %)   | 814,686,250 objects (228 GB)   | 1,605,529 objects (510 MB) | 5,979,280 objects (2.0 GB) | 483h54m29s ago | 139h11m14s ago |
|   6 | 96 (0 %)    | 414,900,841 objects (116 GB)   | 1,902,632 objects (607 MB) | 5,967,869 objects (1.9 GB) | 481h45m54s ago | 143h57m44s ago |
|   7 | 49 (0 %)    | 234,319,065 objects (64 GB)    | 2,121,448 objects (683 MB) | 5,965,578 objects (1.9 GB) | 476h47m27s ago | 147h54m6s ago  |
|   8 | 26 (0 %)    | 118,150,390 objects (35 GB)    | 2,684,429 objects (824 MB) | 5,980,467 objects (1.9 GB) | 448h54m52s ago | 186h47m52s ago |
|   9 | 16 (0 %)    | 73,982,949 objects (21 GB)     | 3,120,638 objects (933 MB) | 5,890,091 objects (1.9 GB) | 412h54m24s ago | 192h52m40s ago |
|  10 | 8 (0 %)     | 40,428,673 objects (11 GB)     | 4,114,611 objects (1.0 GB) | 5,951,686 objects (1.8 GB) | 440h54m0s ago  | 201h56m52s ago |
|  11 | 3 (0 %)     | 14,820,040 objects (4.8 GB)    | 4,802,694 objects (1.4 GB) | 5,177,412 objects (1.7 GB) | 413h52m57s ago | 404h51m39s ago |
|  12 | 2 (0 %)     | 9,721,570 objects (2.7 GB)     | 4,479,829 objects (1.3 GB) | 5,241,741 objects (1.4 GB) | 392h55m37s ago | 378h52m37s ago |
|  13 | 1 (0 %)     | 5,836,103 objects (1.9 GB)     | 5,836,103 objects (1.9 GB) | 5,836,103 objects (1.9 GB) | 358h48m43s ago | 357h51m10s ago |
|  14 | 1 (0 %)     | 5,722,647 objects (1.4 GB)     | 5,722,647 objects (1.4 GB) | 5,722,647 objects (1.4 GB) | 373h49m21s ago | 372h53m22s ago |
|  17 | 1 (0 %)     | 5,905,412 objects (2.0 GB)     | 5,905,412 objects (2.0 GB) | 5,905,412 objects (2.0 GB) | 409h52m7s ago  | 408h50m1s ago  |
+-----+-------------+--------------------------------+----------------------------+----------------------------+----------------+----------------+
  1. Here is metric across 2 days which is a pretty good example of our typical load right now.

  1. No, the compactors have been healthy the last two weeks now.

  2. There is a S3 lifecycle policy of of 16d while Tempo’s block_retention is set to retain for 15d. I thought the 24hrs should be enough, but I can bump it out further if that is an issue.

Thank you for gathering all this data, this helps a lot. Some things look good, and others don’t seem right. It mostly looks like a tuning / scaling issue though.

(1) The file counts look good. 12859+515=13374, which is roughly in line with the number of blocks, accounting for churn while counting. Agree that the dropping of data objects seems resolved.

(2) There are a few things of note in this compaction summary:
(a) There are a lot of blocks stuck in level 0 (71%). When compaction is keeping up well, this should be in the range of 20-30% or less.

(b) Compaction up to level 17 is atypical. Compaction processes 8 blocks at once, so generally it does not achieve higher than level 4 or 5. It looks like compaction is quickly hitting the limit of 6M traces per block by level 2 and unable to proceed further, except for some occurrences where a small block is able to fit in.

(c) There are blocks with quite old data beyond the retention window: 400-500+ hours ago, when retention should only be 360h. Specifically all blocks in levels 11 and 17 should have been deleted as they contain no data within retention. When a block is deleted the compactors log a message like level=info ts=... caller=retention.go:68 msg="deleting block" blockID=... tenantID=... Is this being logged correctly and in sufficient number? Any errors?

There are 2 metrics that should both be increasing at the same rate which indicate successful deletion:

  • tempodb_retention_marked_for_deletion_total
  • tempodb_retention_deleted_total

(3) 30k spans/s is a decent load, but wouldn’t expect it to generate this much data. I should have asked for this the first time, but could you also run these queries?

  • Average span size: sum(rate(tempo_distributor_bytes_received_total[1m])) / sum(rate(tempo_distributor_spans_received_total[1m])) For reference, our average span size is 142 bytes.
  • Number of traces being created: sum(rate(tempo_ingester_traces_created_total[1m])) / 2 The division by 2 is to account for replication factor 2.
  • Average spans per trace: sum(rate(tempo_distributor_spans_received_total[1m])) / (sum(rate(tempo_ingester_traces_created_total[1m])) / 2)

(5) 24 hour buffer should be fine.

Next steps:

  • Run the queries above which help us to further characterize this workload.
  • Check the compactor logs for block deletion messages or any other errors.

I suggest the following tuning recommendations:

  • Increase ingester.max_block_bytes to 2_000_000_000. This will create larger and fewer blocks initially. That is uncompressed size, so the resulting blocks should be around ~200MB.
  • Set compactor.compaction.max_compaction_objects to 10_000_000. This setting defaults to 6M, and it seems that is getting reached with relatively small blocks.
  • Try increasing the number of compactors and checking them for bottlenecks on cpu or network i/o. Even with the small compactor limits, there is still a lot of compaction work to do. 71% of the blocks are level 0 and eligible for compaction, it seems the compactors are not able to keep up.
  • Try reducing compactor.compaction.compaction_window to 30m. This divides up the block time further, and lets compactors work more cooperatively.

Sorry for the lengthy list of tasks! But we will figure this out.

Marked vs deleted:

Seems to be deleting faster than they are being marked…

Avg span per trace:


Traces created:

Avg span size:

The only compactor error in the past two days has been:

err="error shipping block to backend: unexpected error writing bloom-2 error writing object to s3 backend, object single-tenant/69bc7207-105a-4fcb-9ca6-85e6a5f2d9cd/bloom-2: We encountered an internal error.

which lines up with other S3 internal errors we saw. All of the msg="deleting block" messages were successful.

I’ll deploy the tuning changes recommended. Is there a ballpark formula for getting the number of compactors right? While one of them will get cpu bound from time to time, I haven’t seen a time they are both pegged. I’ll look for other bottlenecks though.

We used to have a ballpark estimate of 20K spans/s per compactor, but since workloads can vary so much and with our direction of moving tempo to byte-based limits and metrics, not sure that is appropriate. Here are some metrics for our internal compactor performance with 1M spans/s and RF 3.

  • tempo_distributor_bytes_received_total * 3 / numCompactors = 12-15 MB/s per compactor
  • tempo_distributor_spans_received_total * 3 / numCompactors = 80-100K spans/s per compactor
  • tempo_ingester_traces_created_total / numCompactors = ~2K traces/s per compactor
  • tempo_ingester_bytes_written_total / numCompactors = 20-26MB/s per compactor

For your workload, if we do some rough math:

  • tempo_distributor_spans_received_total: 30K spans/s * 2 / 80K = 0.75 compactors :slight_smile:
  • tempo_distributor_bytes_received_total: 30K spans/s * 2 * 760 bytes/span = ~45 MB/s input, 45/12 = 3.75 compactors
  • tempo_ingester_traces_created_total 7K traces/s * 2 / 2K = 7 compactors
  • tempo_ingester_bytes_written_total Not seeing a data point for this

Comparing the two workloads, it looks like yours is generally of many smaller traces. This brings to mind the update in 0.6.0 to swap index downsampling from trace count to bytes: Index Downsample Bytes by joe-elliott · Pull Request #519 · grafana/tempo · GitHub Are you still on 0.5.0 ? If so, can you try upgrading to 0.6.0? This helps better control the sizes of pages and index entries, and might help. That also introduced max_block_bytes which I mentioned previously. I apologize if that let to any errors, as it is not present in 0.5.0