How to create a gossip ring for tempo

Hi

I am trying to run tempo-distributed in kubernetes but getting the following error messages

"Failed to resolve tempo-distributed-gossip-ring: lookup tempo-distributed-gossip-ring on 10.96.0.10:53: no such host"                                │
"Failed to resolve tempo-distributed-gossip-ring: lookup tempo-distributed-gossip-ring on 10.96.0.10:53: server misbehaving"
msg="pusher failed to consume trace data" err="DoBatch: InstancesCount <= 0"

I have been reading and the reason is because the distributed and ingester can’t communicate. They need to be in the same gossip ring. There is a an issue related to this in this forum but they are using etcd for KV but it is not my case.

Is this gossip ring new to tempo? I don’t remember having this issue in the past.

The question is:

How can I create a gossip ring for tempo? Is there an article showing the steps because I am at lost?

My config file looks like this:

multitenancy_enabled: false
compactor:
  compaction:
    block_retention: 48h
  ring:
    kvstore:
      store: memberlist
distributor:
  ring:
    kvstore:
      store: memberlist
  receivers:
    zipkin:
      endpoint: 0.0.0.0:9411
querier:
  frontend_worker:
    frontend_address: tempo-distributed-query-frontend-discovery:9095
ingester:
  lifecycler:
    ring:
      replication_factor: 1
      kvstore:
        store: memberlist
    tokens_file_path: /var/tempo/tokens.json
memberlist:
  abort_if_cluster_join_fails: false
  join_members:
    - tempo-distributed-gossip-ring
overrides:
  per_tenant_override_config: /conf/overrides.yaml
server:
  http_listen_port: 3100
storage:
  trace:
    azure:
 .....

The gossip ring is not new at all, the implementation hasn’t changed recently afaik. It’s indeed used by memberlist to ensure all components can find each other and it allows them to load balance requests / shard work between each other.
This page has a more information about the different rings: Consistent Hash Ring | Grafana Labs

How are you deploying Tempo? From the generated names I’m guessing you are using the Helm chart and deploying to Kubernetes?
If so, the gossip ring is backed by a headless Kubernetes service (it’s probably called tempo-distributed-gossip-ring). Is this service listed with kubectl get svc?

I am using helm chart for deployment in kubernetes. The service is listed

level=info ts=2021-09-28T23:31:21.198907424Z caller=app.go:251 msg="Tempo started"
ts=2021-09-28T23:31:21.207220866Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve tempo-distributed-gossip-ring: lookup tempo-distributed-gossip-ring on 10.0.0.10:53: no such host"
ts=2021-09-28T23:31:22.814506067Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve tempo-distributed-gossip-ring: lookup tempo-distributed-gossip-ring on 10.0.0.10:53: no such host"
ts=2021-09-28T23:31:24.924130239Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve tempo-distributed-gossip-ring: lookup tempo-distributed-gossip-ring on 10.0.0.10:53: no such host"
level=info ts=2021-09-28T23:31:31.894764221Z caller=memberlist_client.go:533 msg="joined memberlist cluster" reached_nodes=4
^C
 % k get svc -n tempo
NAME                                         TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)                                 AGE
ingress-nginx-controller                     LoadBalancer   10.0.158.39    20.84.31.169   80:32640/TCP,443:30007/TCP              17h
ingress-nginx-controller-admission           ClusterIP      10.0.102.106   <none>         443/TCP                                 17h
tempo-distributed-compactor                  ClusterIP      10.0.199.180   <none>         3100/TCP                                17h
tempo-distributed-distributor                ClusterIP      10.0.11.222    <none>         3100/TCP,9095/TCP,9411/TCP              17h
tempo-distributed-gossip-ring                ClusterIP      None           <none>         7946/TCP                                17h
tempo-distributed-ingester                   ClusterIP      10.0.72.71     <none>         3100/TCP,9095/TCP                       17h
tempo-distributed-memcached                  ClusterIP      10.0.149.244   <none>         11211/TCP,9150/TCP                      17h
tempo-distributed-querier                    ClusterIP      10.0.59.172    <none>         3100/TCP,9095/TCP                       17h
tempo-distributed-query-frontend             ClusterIP      10.0.186.182   <none>         3100/TCP,9095/TCP,16686/TCP,16687/TCP   17h
tempo-distributed-query-frontend-discovery   ClusterIP      None           <none>         3100/TCP,9095/TCP,16686/TCP,16687/TCP   17h

So your config and the services look alright. The headless tempo-distributed-gossip-ring service is used for the gossip ring.


That last log line:

msg="joined memberlist cluster" reached_nodes=4

seems to indicate that this component eventually managed to join memberlist. Does pusher failed to consume trace data still appear?


To get an overview of memberlist you can check out the /memberlist or one of the ring endpoints. /memberlist should list the other members, i.e. all distributors, ingesters and queriers.
See API documentation | Grafana Labs

To visit this page I usually set up a port-forward:

kc port-forward tempo-tempo-distributed-distributor-... 3100:3100

And then visit http://localhost:3100/memberlist

Yeah, I did what you suggested and the memberlist looks right. It eventually settles, it works now thanks. We had a network issue causing problems to query the traces, but the ring is fine.

By the way, I know that Loki has a gateway nginx ingress controller for basic auth as part of the deployment. Do we have something similar for tempo?

Thank you.

Great!

We don’t right now. We have an open issue to document the use of a gateway and this user shared their setup already: Document how to deploy Tempo to ingest traces from multiple clusters · Issue #977 · grafana/tempo · GitHub
If you get a good setup feel free to share your experience as well! In Grafana Cloud we use a custom gateway that is closely tied into our auth infrastructure, so it doesn’t make sense to open-source it.

It is good to have this kind of document, I can see it is on early stage as it only covers routing with an ingress, but still needs basic auth. I started to work on it; at first glance it seemed not that difficult using this nginx ingress contoller set up but it turns out that grafana has troubles reaching the tempo-distributed-query-frontend service.
In Kubernetes I deployed a nginx controller with the following configuration:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/auth-secret: grafana-tempo-auth
    nginx.ingress.kubernetes.io/auth-type: basic
  name: tempo-ingress
  namespace: tempo
spec:
  rules:
  - host: tempo.mycompany.dev
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: tempo-distributed-query-frontend
            port: 
              number: 3100

The nginx ingress controller is doing its job

curl -v tempo.mycompany.dev                                                                                             
*   Trying 52.##.##.##...
* TCP_NODELAY set
* Connected to tempo.mycompany.dev (52.##.##.##) port 80 (#0)
> GET / HTTP/1.1
> Host: tempo.mycompany.dev
> User-Agent: curl/7.64.1
> Accept: */*
> 
< HTTP/1.1 401 Unauthorized
< Date: Sun, 03 Oct 2021 14:31:30 GMT
< Content-Type: text/html
< Content-Length: 172
< Connection: keep-alive
< WWW-Authenticate: Basic realm=""
< 
<html>
<head><title>401 Authorization Required</title></head>
<body>
<center><h1>401 Authorization Required</h1></center>
<hr><center>nginx</center>
</body>
</html>
* Connection #0 to host tempo.mycomany.dev left intact
* Closing connection 0

Once that you enter credentials it lets you in, not sure if getting a 404 from tempo-distributed-query-frontend is Okay though.

curl -v tempo.mycompany.dev -u "user:password"
*   Trying 52.##.##.##...
* TCP_NODELAY set
* Connected to tempo.mycompany.dev (52.##.##.##) port 80 (#0)
* Server auth using Basic with user 'user'
> GET / HTTP/1.1
> Host: tempo.mycompany.dev
> Authorization: Basic MDUyZTdkODQtYjA2ZC00OGFjLWJhMzctZGE4YTM0MmQ4NGM3OmVReTZmNjR0Z==
> User-Agent: curl/7.64.1
> Accept: */*
> 
< HTTP/1.1 404 Not Found
< Date: Sun, 03 Oct 2021 19:43:58 GMT
< Content-Type: text/plain; charset=utf-8
< Content-Length: 19
< Connection: keep-alive
< X-Content-Type-Options: nosniff
< 
404 page not found
* Connection #0 to host tempo.mycompany.dev left intact
* Closing connection 0

However when I create a datasource in grafana and point it to the nginx ingress controller the test passes but when I query for a trace I get the 404 that I get with the curl command.

You said you are using a custom gateway for tempo, are you using nginx? is it possible to share part of the configuration? specially how to configure to gran access to tempo-distributed-query-frontend service.

Yeah, that’s expected: there is nothing at /. I recommend using /api/echo to verify the query-frontend is reachable. To query a trace use /api/trace/<traceID>.
Unless you have configured http_api_prefix, in that case it will be /<prefix>/api/echo

This error is weird, Tempo usually answers with 404 trace not found. Something else is returning this 404 I think.

It’s a custom Go app, not based upon nginx or anything else. It accepts requests on a limited amount of paths, verifies authentication, sets the X-Scope-OrgID header (only needed if you run Tempo multitenant) and passes the request to either the distributor or the query-frontend (depending on the path).

Our gateway only allows:

  • GET /api/echo
  • GET /api/traces/{traceID}
  • POST /opentelemetry.proto.collector.trace.v1.TraceService/Export (OTLP GRPC).

We also have Envoy running between our gateway and the distributors to ensure load-balancing is GRPC-aware. The default Go load balancer does round robin which isn’t great for GRPC streams. This is explained in a bit more detail here: gRPC Load Balancing | gRPC
This will only be necessary if you are working with GRPC of course.