Tempo distributed helm-chart configuration

Hi there,

We are trying to use tempo which is configured using tempo-distributed helm chart. Though the helm release is installed I could see some of the errors and we could not ingest traces.

distributor logs:

ts=2023-10-25T11:44:27.123569974Z caller=memberlist_logger.go:74 level=info msg="Marking grafana-tempo-distributor-58cc77f749-gfs2t-e678e20c as failed, suspect timeout reached (2 peer confirmations)"
level=warn ts=2023-10-25T11:44:30.786848372Z caller=tcp_transport.go:254 component="memberlist TCPTransport" msg="failed to read message type" err=EOF remote=10.68.3.57:38880
level=warn ts=2023-10-25T11:44:33.121271085Z caller=tcp_transport.go:438 component="memberlist TCPTransport" msg="WriteTo failed" addr=10.68.3.59:7946 err="dial tcp 10.68.3.59:7946: i/o timeout"
level=warn ts=2023-10-25T11:44:38.122348155Z caller=tcp_transport.go:438 component="memberlist TCPTransport" msg="WriteTo failed" addr=10.68.3.57:7946 err="dial tcp 10.68.3.57:7946: i/o timeout"
level=warn ts=2023-10-25T11:44:43.123416271Z caller=tcp_transport.go:438 component="memberlist TCPTransport" msg="WriteTo failed" addr=10.68.3.59:7946 err="dial tcp 10.68.3.59:7946: i/o timeout"
level=warn ts=2023-10-25T11:44:56.122244891Z caller=tcp_transport.go:438 component="memberlist TCPTransport" msg="WriteTo failed" addr=10.68.3.59:7946 err="dial tcp 10.68.3.59:7946: i/o timeout"

Can someone please guide what might be wrong in the configuration?

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: grafana-tempo
  namespace: tempo
spec:
  interval: 30m
  chart:
    spec:
      chart: tempo-distributed
      version: "~1"
      sourceRef:
        kind: HelmRepository
        name: grafana-charts
        namespace: tempo
  values:
    serviceAccount:
      name: sample-service
    multitenancy_enabled: false
    compactor:
      compaction:
        block_retention: 48h
      ring:
        kvstore:
          store: memberlist
    distributor:
      receivers:
        otlp:
          protocols:
            grpc:
    ingester:
      lifecycler:
        ring:
          replication_factor: 1
      persistence:
        size: 25Gi
        storageClass:
          storageClassName: regional-storage
    traces:
      otlp:
        grpc:
          enabled: true
    memberlist:
      abort_if_cluster_join_fails: false
    server:
      http_listen_port: 3100
    storage:
      trace:
        backend: gcs
        gcs:
          bucket_name: xxxxxxx
      pool:
        queue_depth: 2000
      wal:
        path: /var/tempo/wal
      memcached:
        consistent_hash: true
        host: xxx
        service: memcached-client
        timeout: 500ms

This error indicates that memberlist gossip is failing. The pods gossip ring state to each other on this port, and the distributors become aware of the ingesters so they can route traffic.

Here are a few ideas:

  1. Check the cluster networking layer, is there a rule that prevents this traffic?
  2. Browse to localhost:3200/memberlist on some distributor and ingester pods - is any pod seeing any other pods? That could help determine if the issue is isolated to a specific set of pods or all of them.
  3. [Purely for testing!] You can actually hard-code the ring members and this will check if the issue is with connection or discovery. The tempo config yaml looks like this (I’m not sure how exactly this is exposed in the helm chart):
memberlist:
  join_members:
  - dns+<pod>:7946       # repeat for each pod
  - dns+ingester-0:7946  # example
1 Like