Hello! I’ve been banging my head on this all day now. I’ve got a Tempo POC deployment set up in AWS under Fargate, based off my in-production Loki deployment. I have the temp distributor configured to receive jaeger thrift over http. I’m using Grafana 7.5.0 so I started pulling ‘grafana/tempo:latest’ from dockerhub after figuring out that the 0.6.0 release didn’t work with Grafana trace queries.
And they’re using the memberlist on 7649. I’m using the docker-compose example to generate traces and throw them at my distributor, but all I’m getting is this error in the distributor logs:
level=error ts=2021-03-28T00:47:41.97961568Z caller=log.go:27 msg=“pusher failed to consume trace data” err=“rpc error: code = Unimplemented desc = unknown service tempopb.Pusher”
Checking the troubleshooting page in the documentation I followed up on the metrics, the distributor records a few hundred traces received, but that they all failed. The ingester shows no indication of any activity. They’re all running from a common config file, all with the same ports exposed and allowed over security group.
Full config (values are templated in at runtime). The grpc port is set to 9095 and http port to 3100, log level is info, and each service gets its own TARGET env value passed in. Matches the defaults, but I pull them in from ParameterStore so that the infrastructure (terraform deployed) and the container config align. The endpoint discovery urls are AWS Cloudmap DNS A Records that the containers automatically register with. Jaeger port appears as hardcoded at the moment because I was flipping between configs trying to figure out why the ingest wasn’t working.
target: {{ env.Getenv "TARGET" "all" }}
auth_enabled: false
server:
http_listen_address: 0.0.0.0
grpc_listen_address: 0.0.0.0
http_listen_port: {{$http_port}}
grpc_listen_port: {{ $grpc_port }}
log_level: {{ $log_level }}
distributor:
receivers:
jaeger:
protocols:
thrift_http:
endpoint: 0.0.0.0:14268
ingester:
lifecycler:
ring:
replication_factor: 1
# the length of time after a trace has not received spans to consider it complete and flush it
trace_idle_period: 30s
# cut the head block when it hits this size
max_block_bytes: 1_000_000
# or after this much time passes
max_block_duration: 1h
query_frontend:
query_shards: 10 # number of shards to split the query into
querier:
frontend_worker:
frontend_address: {{ $env_query_frontend_discovery }}:{{ $grpc_port }}
compactor:
ring:
kvstore:
store: memberlist
compaction:
block_retention: {{ $block_retention }} # Optional. Duration to keep blocks. Default is 14 days (336h).
compacted_block_retention: 1h # Optional. Duration to keep blocks that have been compacted elsewhere
compaction_window: 4h # Optional. Blocks in this time window will be compacted together
storage:
trace:
backend: s3
s3:
bucket: {{ .Env.S3_BUCKET }}
endpoint: s3.dualstack.{{ .Env.AWS_REGION }}.amazonaws.com
region: {{ .Env.AWS_REGION }}
insecure: false
cache: redis
redis:
endpoint: {{ $env_redis_discovery }}:6379
memberlist:
# A DNS entry that lists all tempo components
join_members:
- dns+{{ $env_compactor_discovery }}:{{ $memberlist_port }}
- dns+{{ $env_distributor_discovery }}:{{ $memberlist_port }}
- dns+{{ $env_ingester_discovery }}:{{ $memberlist_port }}
- dns+{{ $env_querier_discovery }}:{{ $memberlist_port }}
- dns+{{ $env_query_frontend_discovery }}:{{ $memberlist_port }}
The ingesters (maybe even the entire system) are registering to the ring with the internal container IP.
I had seen this before with loki but forgot to copy this bit over to my tempo config.
For anyone in the future running under Fargate 1.4.0 - you need to specify the ingester lifecycler like so:
lifecycler:
# for faragate 1.4.0 use eth1; use the default for other platforms
interface_names: ["eth1"]