Distributor/ingester issue consuming traces

zswanson · March 28, 2021, 1:37am

Hello! I’ve been banging my head on this all day now. I’ve got a Tempo POC deployment set up in AWS under Fargate, based off my in-production Loki deployment. I have the temp distributor configured to receive jaeger thrift over http. I’m using Grafana 7.5.0 so I started pulling ‘grafana/tempo:latest’ from dockerhub after figuring out that the 0.6.0 release didn’t work with Grafana trace queries.

distributor:
  receivers:
    jaeger:
      protocols:
        thrift_http:
          endpoint: 0.0.0.0:14268

And in the ingester is the basic stock config.

ingester:
  lifecycler:
    ring:
      replication_factor: 1
  trace_idle_period: 30s
  max_block_bytes: 1_000_000
  max_block_duration: 1h

Both have the same server block:

server:
  http_listen_address: 0.0.0.0
  grpc_listen_address: 0.0.0.0
  http_listen_port: 3100
  grpc_listen_port: 9095

And they’re using the memberlist on 7649. I’m using the docker-compose example to generate traces and throw them at my distributor, but all I’m getting is this error in the distributor logs:

level=error ts=2021-03-28T00:47:41.97961568Z caller=log.go:27 msg=“pusher failed to consume trace data” err=“rpc error: code = Unimplemented desc = unknown service tempopb.Pusher”

Checking the troubleshooting page in the documentation I followed up on the metrics, the distributor records a few hundred traces received, but that they all failed. The ingester shows no indication of any activity. They’re all running from a common config file, all with the same ports exposed and allowed over security group.

Anything else I can double check?

joeelliott · March 29, 2021, 12:28pm

That’s an odd error. It sounds like the distributors are connecting to a process that doesn’t implement the GRPC service.

Can you share the contents of your ingester ring? You can access it by going to http://<addr>/ingester/ring on the distributors.

zswanson · March 29, 2021, 3:09pm

Full config (values are templated in at runtime). The grpc port is set to 9095 and http port to 3100, log level is info, and each service gets its own TARGET env value passed in. Matches the defaults, but I pull them in from ParameterStore so that the infrastructure (terraform deployed) and the container config align. The endpoint discovery urls are AWS Cloudmap DNS A Records that the containers automatically register with. Jaeger port appears as hardcoded at the moment because I was flipping between configs trying to figure out why the ingest wasn’t working.

target: {{ env.Getenv "TARGET" "all" }}

auth_enabled: false

server:
  http_listen_address: 0.0.0.0
  grpc_listen_address: 0.0.0.0
  http_listen_port: {{$http_port}}
  grpc_listen_port: {{ $grpc_port }}

  log_level: {{ $log_level }}

distributor:
  receivers:
    jaeger:
      protocols:
        thrift_http:
          endpoint: 0.0.0.0:14268

ingester:
  lifecycler:
    ring:
      replication_factor: 1
  # the length of time after a trace has not received spans to consider it complete and flush it
  trace_idle_period: 30s
  # cut the head block when it hits this size
  max_block_bytes: 1_000_000
  # or after this much time passes
  max_block_duration: 1h

query_frontend:
  query_shards: 10    # number of shards to split the query into

querier:
  frontend_worker:
    frontend_address: {{ $env_query_frontend_discovery }}:{{ $grpc_port }}

compactor:
  ring:
    kvstore:
      store: memberlist
  compaction:
    block_retention: {{ $block_retention }}   # Optional. Duration to keep blocks.  Default is 14 days (336h).
    compacted_block_retention: 1h       # Optional. Duration to keep blocks that have been compacted elsewhere
    compaction_window: 4h               # Optional. Blocks in this time window will be compacted together

storage:
  trace:
    backend: s3
    s3:
      bucket: {{ .Env.S3_BUCKET }}
      endpoint: s3.dualstack.{{ .Env.AWS_REGION }}.amazonaws.com
      region: {{ .Env.AWS_REGION }}
      insecure: false
    cache: redis
    redis:
      endpoint: {{ $env_redis_discovery }}:6379

memberlist:
  # A DNS entry that lists all tempo components
  join_members:
    - dns+{{ $env_compactor_discovery }}:{{ $memberlist_port }}
    - dns+{{ $env_distributor_discovery }}:{{ $memberlist_port }}
    - dns+{{ $env_ingester_discovery }}:{{ $memberlist_port }}
    - dns+{{ $env_querier_discovery }}:{{ $memberlist_port }}
    - dns+{{ $env_query_frontend_discovery }}:{{ $memberlist_port }}

joeelliott · March 29, 2021, 3:28pm

Nothing immediate seems wrong with your config or ingester ring. Some recommendations:

confirm that ingesters/distributors are running the same version
use a tool like GitHub - fullstorydev/grpcurl: Like cURL, but for gRPC: Command-line tool for interacting with gRPC servers to try to hit the ingesters directly using one of the rpc methods it should support: tempo/tempo.proto at master · grafana/tempo · GitHub
Check this metric: tempo/distributor.go at 52462ae0d47cdf5818a4daf36e4ac5f47e6bbf60 · grafana/tempo · GitHub it should show ingester push failures by ingester. This will help us narrow down which ingester is failing.

zswanson · March 29, 2021, 3:33pm

Oh man that did it.

tempo_distributor_ingester_append_failures_total{ingester=“169.254.172.2:9095”} 38

The ingesters (maybe even the entire system) are registering to the ring with the internal container IP.

I had seen this before with loki but forgot to copy this bit over to my tempo config.
For anyone in the future running under Fargate 1.4.0 - you need to specify the ingester lifecycler like so:

  lifecycler:
    # for faragate 1.4.0 use eth1; use the default for other platforms
    interface_names: ["eth1"]

joeelliott · March 29, 2021, 4:01pm

Oh nice!

Yeah the Cortex ring code searches common interface names (eth0 and en0 by default) to determine what IP to publish in the ring. Good catch!

Made this issue: Document Fargate requirement · Issue #622 · grafana/tempo · GitHub

system · March 29, 2022, 4:01pm

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Missing traces in Tempo Grafana Tempo	6	1529	May 17, 2024
Noisy error logs in distributor component Grafana Tempo	2	1762	August 26, 2022
Grafana Agent auth handshake failed Grafana Tempo	4	3187	November 16, 2022
Tempo Vulture fails to query Tempo 90/95% of the time Grafana Tempo	3	1159	August 27, 2022
Tempo ingester ring not forming Grafana Tempo	4	2838	August 19, 2022

Distributor/ingester issue consuming traces

Related topics