Distributor/ingester issue consuming traces

Hello! I’ve been banging my head on this all day now. I’ve got a Tempo POC deployment set up in AWS under Fargate, based off my in-production Loki deployment. I have the temp distributor configured to receive jaeger thrift over http. I’m using Grafana 7.5.0 so I started pulling ‘grafana/tempo:latest’ from dockerhub after figuring out that the 0.6.0 release didn’t work with Grafana trace queries.

distributor:
  receivers:
    jaeger:
      protocols:
        thrift_http:
          endpoint: 0.0.0.0:14268

And in the ingester is the basic stock config.

ingester:
  lifecycler:
    ring:
      replication_factor: 1
  trace_idle_period: 30s
  max_block_bytes: 1_000_000
  max_block_duration: 1h

Both have the same server block:

server:
  http_listen_address: 0.0.0.0
  grpc_listen_address: 0.0.0.0
  http_listen_port: 3100
  grpc_listen_port: 9095

And they’re using the memberlist on 7649. I’m using the docker-compose example to generate traces and throw them at my distributor, but all I’m getting is this error in the distributor logs:

level=error ts=2021-03-28T00:47:41.97961568Z caller=log.go:27 msg=“pusher failed to consume trace data” err=“rpc error: code = Unimplemented desc = unknown service tempopb.Pusher”

Checking the troubleshooting page in the documentation I followed up on the metrics, the distributor records a few hundred traces received, but that they all failed. The ingester shows no indication of any activity. They’re all running from a common config file, all with the same ports exposed and allowed over security group.

Anything else I can double check?

That’s an odd error. It sounds like the distributors are connecting to a process that doesn’t implement the GRPC service.

Can you share the contents of your ingester ring? You can access it by going to http://<addr>/ingester/ring on the distributors.

Full config (values are templated in at runtime). The grpc port is set to 9095 and http port to 3100, log level is info, and each service gets its own TARGET env value passed in. Matches the defaults, but I pull them in from ParameterStore so that the infrastructure (terraform deployed) and the container config align. The endpoint discovery urls are AWS Cloudmap DNS A Records that the containers automatically register with. Jaeger port appears as hardcoded at the moment because I was flipping between configs trying to figure out why the ingest wasn’t working.

target: {{ env.Getenv "TARGET" "all" }}

auth_enabled: false

server:
  http_listen_address: 0.0.0.0
  grpc_listen_address: 0.0.0.0
  http_listen_port: {{$http_port}}
  grpc_listen_port: {{ $grpc_port }}

  log_level: {{ $log_level }}

distributor:
  receivers:
    jaeger:
      protocols:
        thrift_http:
          endpoint: 0.0.0.0:14268

ingester:
  lifecycler:
    ring:
      replication_factor: 1
  # the length of time after a trace has not received spans to consider it complete and flush it
  trace_idle_period: 30s
  # cut the head block when it hits this size
  max_block_bytes: 1_000_000
  # or after this much time passes
  max_block_duration: 1h

query_frontend:
  query_shards: 10    # number of shards to split the query into

querier:
  frontend_worker:
    frontend_address: {{ $env_query_frontend_discovery }}:{{ $grpc_port }}

compactor:
  ring:
    kvstore:
      store: memberlist
  compaction:
    block_retention: {{ $block_retention }}   # Optional. Duration to keep blocks.  Default is 14 days (336h).
    compacted_block_retention: 1h       # Optional. Duration to keep blocks that have been compacted elsewhere
    compaction_window: 4h               # Optional. Blocks in this time window will be compacted together

storage:
  trace:
    backend: s3
    s3:
      bucket: {{ .Env.S3_BUCKET }}
      endpoint: s3.dualstack.{{ .Env.AWS_REGION }}.amazonaws.com
      region: {{ .Env.AWS_REGION }}
      insecure: false
    cache: redis
    redis:
      endpoint: {{ $env_redis_discovery }}:6379

memberlist:
  # A DNS entry that lists all tempo components
  join_members:
    - dns+{{ $env_compactor_discovery }}:{{ $memberlist_port }}
    - dns+{{ $env_distributor_discovery }}:{{ $memberlist_port }}
    - dns+{{ $env_ingester_discovery }}:{{ $memberlist_port }}
    - dns+{{ $env_querier_discovery }}:{{ $memberlist_port }}
    - dns+{{ $env_query_frontend_discovery }}:{{ $memberlist_port }}

Nothing immediate seems wrong with your config or ingester ring. Some recommendations:

  1. confirm that ingesters/distributors are running the same version
  2. use a tool like GitHub - fullstorydev/grpcurl: Like cURL, but for gRPC: Command-line tool for interacting with gRPC servers to try to hit the ingesters directly using one of the rpc methods it should support: tempo/tempo.proto at master · grafana/tempo · GitHub
  3. Check this metric: tempo/distributor.go at 52462ae0d47cdf5818a4daf36e4ac5f47e6bbf60 · grafana/tempo · GitHub it should show ingester push failures by ingester. This will help us narrow down which ingester is failing.

Oh man that did it.

tempo_distributor_ingester_append_failures_total{ingester=“169.254.172.2:9095”} 38

The ingesters (maybe even the entire system) are registering to the ring with the internal container IP.

I had seen this before with loki but forgot to copy this bit over to my tempo config.
For anyone in the future running under Fargate 1.4.0 - you need to specify the ingester lifecycler like so:

  lifecycler:
    # for faragate 1.4.0 use eth1; use the default for other platforms
    interface_names: ["eth1"]

Oh nice!

Yeah the Cortex ring code searches common interface names (eth0 and en0 by default) to determine what IP to publish in the ring. Good catch!

Made this issue: Document Fargate requirement · Issue #622 · grafana/tempo · GitHub