Can't get tracing with Grafana Agent to work

Hello.

As there does not seem to be a dedicated Grafana Agent section, I’m posting here.

I am evaluating the Grafana Stack for observability and I’m stuck trying to get traces shipped through Grafana Agent to work. Sending my traces diretly to Tempo works.

My setup.
Everything is deployed on Kubernetes, mostly using the Grafana helm chart. Single node Tempo and Loki. Grafana Agent is deployed “manually” using some examples I found on GitHub. Grafana Agent works as expected for metrics and logs. I am using this demo app from Jaeger to produce traces.

I can go into more detail if needed. The very short version is, when I set the env vars for the demo app to point to the Tempo service object, the tracing works fine

      JAEGER_AGENT_HOST:  tempo.tempo
      JAEGER_AGENT_PORT:  6831

With Grafana Agent, I assumed the same port would support the same format by default, so I tried that port. I also tried other jaeger ports.

Both services listen on similar ports

$ k get service -n tempo
NAME    TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                                                                                                    AGE
tempo   ClusterIP   172.20.45.89   <none>        3100/TCP,16686/TCP,6831/UDP,6832/UDP,14268/TCP,14250/TCP,9411/TCP,55680/TCP,55681/TCP,4317/TCP,55678/TCP   16d
$ k get service -n grafana-agent
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                                       AGE
grafana-agent     ClusterIP   172.20.188.115   <none>        8080/TCP,6831/UDP,6832/UDP,14268/TCP,14250/TCP,9411/TCP,55680/TCP,55678/TCP   2d18h

Yes, they are in different namespaces. I have checked network policies. I have run the demo app in different namespaces as well. Using Tempo works always. Shipping to Grafana Agent is like shipping to a black hole.

This is my config

$ k describe configmap grafana-agent -n grafana-agent
Name:         grafana-agent
Namespace:    grafana-agent
Labels:       <none>
Annotations:
Data
====
agent.yaml:
----
server:
    http_listen_port: 8080
    log_level: debug

metrics:
  wal_directory: /tmp/wal
  global:
    scrape_interval: 1m
    remote_write:
      - url: http://victoriametrics.victoriametrics:8428/api/v1/write
  configs:
    - name: default
      scrape_configs:
      - job_name: kubernetes_pods
        kubernetes_sd_configs:
          - role: pod
            selectors:
            - role: "pod"
              label: "metrics=grafana-agent"
logs:
  configs:
  - name: default
    positions:
      filename: /tmp/positions_traces.yaml
    clients:
      - url: http://loki.loki:3100/loki/api/v1/push

traces:
  configs:
  - name: default
    receivers:
      jaeger:
        protocols:
          grpc:
          thrift_compact:
    remote_write:
      - endpoint: tempo.tempo:55680
        insecure: true
    batch:
      timeout: 5s
      send_batch_size: 100
    automatic_logging:
      backend: logs_instance
      logs_instance_name: default
      spans: true
      processes: true
      roots: true

Events:  <none>

I have tried to follow the troubleshooting guide (which is out-of-date but still)

Based on the metrics I can see, the agent is not receiving any traces.

I’m running out of ideas what to try next…

All advice is welcome :slight_smile:

Hi!

I’ve used the demo with docker-compose with the provided config and it has worked for me. Any further details on how you’ve deployed the Agent in Kubernetes would be of help to understand what is the issue.

Are you getting any logs from the Agent or the HotROD app that contain errors? The HotROD app has a metrics endpoint, have you checked the group of metrics under this namespace route.jaeger.tracer.reporter_spans?

There is a quickstart guide for the Agent in k8s for a tracing set up, in case you want to check it up.

From what you’ve shared the Agent seem to correctly configured. Maybe it’s worth checking if the needed ports are also opened in the Agent’s container.

Also, thanks for letting us know about the docs being out-of-date. I’ve just seen another couple of places where some links haven’t been updated either.

Hi @mariorodriguez ,

thank you very much for you reply :slight_smile:

These are the HotROD app metrics after redeploying and zeroing the counters and making a few requests. To me that all looks right?

"route.jaeger.tracer.baggage_restrictions_updates.result_err": 0,
"route.jaeger.tracer.baggage_restrictions_updates.result_ok": 0,
"route.jaeger.tracer.baggage_truncations": 0,
"route.jaeger.tracer.baggage_updates.result_err": 0,
"route.jaeger.tracer.baggage_updates.result_ok": 0,
"route.jaeger.tracer.finished_spans.sampled_delayed": 0,
"route.jaeger.tracer.finished_spans.sampled_n": 0,
"route.jaeger.tracer.finished_spans.sampled_y": 352,
"route.jaeger.tracer.reporter_queue_length": 0,
"route.jaeger.tracer.reporter_spans.result_dropped": 0,
"route.jaeger.tracer.reporter_spans.result_err": 0,
"route.jaeger.tracer.reporter_spans.result_ok": 352,
"route.jaeger.tracer.sampler_queries.result_err": 0,
"route.jaeger.tracer.sampler_queries.result_ok": 0,
"route.jaeger.tracer.sampler_updates.result_err": 0,
"route.jaeger.tracer.sampler_updates.result_ok": 0,
"route.jaeger.tracer.span_context_decoding_errors": 0,
"route.jaeger.tracer.started_spans.sampled_delayed": 0,
"route.jaeger.tracer.started_spans.sampled_n": 0,
"route.jaeger.tracer.started_spans.sampled_y": 353,
"route.jaeger.tracer.throttled_debug_spans": 0,
"route.jaeger.tracer.throttler_updates.result_err": 0,
"route.jaeger.tracer.throttler_updates.result_ok": 0,
"route.jaeger.tracer.traces.sampled_n.state_joined": 0,
"route.jaeger.tracer.traces.sampled_n.state_started": 0,
"route.jaeger.tracer.traces.sampled_y.state_joined": 350,
"route.jaeger.tracer.traces.sampled_y.state_started": 3,

My Grafana Agent deployment is done like this.

---
apiVersion: v1
data:
  agent.yaml: |
    server:
        http_listen_port: 8080
        log_level: debug

    metrics:
      wal_directory: /tmp/wal
      global:
        scrape_interval: 1m
        remote_write:
          - url: http://victoriametrics.victoriametrics:8428/api/v1/write
      configs:
        - name: default
          scrape_configs:
          - job_name: kubernetes_pods
            kubernetes_sd_configs:
              - role: pod
                selectors:
                - role: "pod"
                  label: "metrics=grafana-agent"
    logs:
      configs:
      - name: default
        positions:
          filename: /tmp/positions_traces.yaml
        clients:
          - url: http://loki.loki:3100/loki/api/v1/push

    traces:
      configs:
      - name: default
        receivers:
          jaeger:
            protocols:
              grpc:
              thrift_compact:
        remote_write:
          - endpoint: tempo.tempo:55680
            insecure: true
        batch:
          timeout: 5s
          send_batch_size: 100
        automatic_logging:
          backend: logs_instance
          logs_instance_name: default
          spans: true
          processes: true
          roots: true

kind: ConfigMap
metadata:
  name: grafana-agent-full
  namespace: grafana-agent
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana-agent-full
  namespace: grafana-agent
  labels:
    io.kompose.service: hotrod
    metrics: "grafana-agent"
    logs: "grafana-agent"
    traces: "grafana-agent"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grafana-agent
  template:
    metadata:
      labels:
        app: grafana-agent
        metrics: "grafana-agent"
        logs: "grafana-agent"
        traces: "grafana-agent"
    spec:
      containers:
      - args:
        - -config.file=/etc/agent/agent.yaml
        env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        image: grafana/agent:v0.19.0
        imagePullPolicy: IfNotPresent
        name: agent
        ports:
        - containerPort: 8080
          name: http-metrics
        - containerPort: 6831
          name: t-j-t-compact
          protocol: UDP
        - containerPort: 6832
          name: t-j-t-binary
          protocol: UDP
        - containerPort: 14268
          name: t-j-t-http
          protocol: TCP
        - containerPort: 14250
          name: t-j-grpc
          protocol: TCP
        - containerPort: 9411
          name: tempo-zipkin
          protocol: TCP
        - containerPort: 55680
          name: tempo-otlp
          protocol: TCP
        - containerPort: 55678
          name: t-opencensus
          protocol: TCP
        volumeMounts:
        - mountPath: /etc/agent
          name: grafana-agent-full
      serviceAccount: grafana-agent-logs
      tolerations:
      - effect: NoSchedule
        operator: Exists
      volumes:
      - configMap:
          name: grafana-agent-full
        name: grafana-agent-full
---
apiVersion: v1
kind: Service
metadata:
  name: grafana-agent-full
  namespace: grafana-agent
  labels:
    name: grafana-agent-full
spec:
  ports:
  - name: agent-http-metrics
    port: 8080
    targetPort: 8080
  - name: agent-t-j-t-compact
    port: 6831
    protocol: UDP
    targetPort: 6831
  - name: agent-t-j-t-binary
    port: 6832
    protocol: UDP
    targetPort: 6832
  - name: agent-t-j-t-http
    port: 14268
    protocol: TCP
    targetPort: 14268
  - name: agent-t-j-grpc
    port: 14250
    protocol: TCP
    targetPort: 14250
  - name: agent-tempo-zipkin
    port: 9411
    protocol: TCP
    targetPort: 9411
  - name: agent-tempo-otlp
    port: 55680
    protocol: TCP
    targetPort: 55680
  - name: agent-t-opencensus
    port: 55678
    protocol: TCP
    targetPort: 55678
  selector:
    name: grafana-agent-full

The name I actually use for the service and deployment is grafana-agent-full. In my original post I had changed that to grafana-agent.

This is the startup log of the Grafana Agent pod

$ k logs grafana-agent-full-79f78594bf-2n9fr -n grafana-agent
ts=2021-10-22T11:02:04.385735436Z caller=node.go:85 level=info agent=prometheus component=cluster msg="applying config"
ts=2021-10-22T11:02:04.386045145Z caller=remote.go:180 level=info agent=prometheus component=cluster msg="not watching the KV, none set"
ts=2021-10-22T11:02:04.386639052Z caller=config_watcher.go:135 level=debug agent=prometheus component=cluster msg="waiting for next reshard interval" last_reshard=2021-10-22T11:02:04.386619244Z next_reshard=2021-10-22T11:03:04.386619244Z remaining=59.999998779s
ts=2021-10-22T11:02:04Z level=info caller=traces/traces.go:120 msg="Traces Logger Initialized" component=traces
ts=2021-10-22T11:02:04Z level=info caller=traces/instance.go:122 msg="shutting down receiver" component=traces traces_config=default
ts=2021-10-22T11:02:04Z level=info caller=traces/instance.go:122 msg="shutting down processors" component=traces traces_config=default
ts=2021-10-22T11:02:04Z level=info caller=traces/instance.go:122 msg="shutting down exporters" component=traces traces_config=default
ts=2021-10-22T11:02:04.389982673Z caller=instance.go:301 level=debug agent=prometheus instance=9b6cec8990db03140ef4948dfc33097f msg="initializing instance" name=9b6cec8990db03140ef4948dfc33097f
ts=2021-10-22T11:02:04Z level=info caller=builder/exporters_builder.go:266 msg="Exporter was built." component=traces traces_config=default kind=exporter name=otlp/0
ts=2021-10-22T11:02:04Z level=info caller=builder/exporters_builder.go:93 msg="Exporter is starting..." component=traces traces_config=default kind=exporter name=otlp/0
ts=2021-10-22T11:02:04Z level=info caller=builder/exporters_builder.go:98 msg="Exporter started." component=traces traces_config=default kind=exporter name=otlp/0
ts=2021-10-22T11:02:04Z level=info caller=builder/pipelines_builder.go:207 msg="Pipeline was built." component=traces traces_config=default pipeline_name=traces pipeline_datatype=traces
ts=2021-10-22T11:02:04Z level=info caller=builder/pipelines_builder.go:52 msg="Pipeline is starting..." component=traces traces_config=default pipeline_name=traces pipeline_datatype=traces
ts=2021-10-22T11:02:04Z level=info caller=builder/pipelines_builder.go:63 msg="Pipeline is started." component=traces traces_config=default pipeline_name=traces pipeline_datatype=traces
ts=2021-10-22T11:02:04Z level=info caller=builder/receivers_builder.go:231 msg="Receiver was built." component=traces traces_config=default kind=receiver name=jaeger datatype=traces
ts=2021-10-22T11:02:04Z level=info caller=builder/receivers_builder.go:71 msg="Receiver is starting..." component=traces traces_config=default kind=receiver name=jaeger
ts=2021-10-22T11:02:04Z level=info caller=static/strategy_store.go:201 msg="No sampling strategies provided or URL is unavailable, using defaults" component=traces traces_config=default kind=receiver name=jaeger
ts=2021-10-22T11:02:04Z level=info caller=builder/receivers_builder.go:76 msg="Receiver started." component=traces traces_config=default kind=receiver name=jaeger
ts=2021-10-22T11:02:04.395934568Z caller=manager.go:208 level=debug msg="Applying integrations config changes"
ts=2021-10-22T11:02:04.397510327Z caller=server.go:77 level=info msg="server configuration changed, restarting server"
ts=2021-10-22T11:02:04.399925501Z caller=gokit.go:47 level=info http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"

Using netcat I have tried to verify that the port is open (spinning up a temporary troubleshooting container)

$ kubectl run tmp-shell -n grafana-agent --rm -i --tty --image nicolaka/netshoot -- /bin/bash
If you don't see a command prompt, try pressing enter.
bash-5.1# nc -v -z -u grafana-agent-full.grafana-agent 6831
Connection to grafana-agent-full.grafana-agent 6831 port [udp/*] succeeded!

I’m new to Kubernetes so I would not be surprised if this is some simple n00b thing that I have missed…

Cheers!

This is very possibly a network policy issue :grimacing:

I can’t curl the Grafana Agent metrics endpoint from within the same namespace. But I can from another namespace :man_facepalming:

I’ll report back once I have this working (curling the metrics endpoint…)

After allowing network traffic within my namespace and actually getting the Service to select the correct pods for endpoints, Grafana Agent is now at least receiving the traces :slight_smile:

Based on exposed Prometheus metrics

# curl -s http://grafana-agent-full.grafana-agent:8080/metrics | grep traces_ | grep -v "#"
traces_exporter_queue_size{exporter="otlp/0",traces_config="default"} 0
traces_exporter_send_failed_spans{exporter="otlp/0",traces_config="default"} 721
traces_exporter_sent_spans{exporter="otlp/0",traces_config="default"} 0
traces_receiver_accepted_spans{receiver="jaeger",traces_config="default",transport="udp_thrift_compact"} 721
traces_receiver_refused_spans{receiver="jaeger",traces_config="default",transport="udp_thrift_compact"} 0

Now I just have to find what the problem is with sending the traces to Tempo… And it looks like I’m having similar issues sending the automatic_logging messages to Loki…

Probably more network policy issues. I will report back.

1 Like

Looks like I have most of it working now :slight_smile:

@mariorodriguez (or anyone else…)

Have you seen an error like this before? This is from Grafana Agent logs

"ts=2021-10-22T13:38:48Z level=error caller=exporterhelper/queued_retry.go:288 msg=\"Exporting failed. The error is not retryable. Dropping data.\" component=traces traces_config=default kind=exporter name=otlp/0 error=\"failed to push trace data via OTLP exporter: Permanent error: rpc error: code = Unimplemented desc = unknown service opentelemetry.proto.collector.trace.v1.TraceService\" dropped_items=62\n"

I have now changed the traces config to

    traces:
      configs:
      - name: default
        receivers:
          jaeger:
            protocols:
              grpc:
              thrift_compact:
        remote_write:
          - endpoint: tempo.tempo:14250
            insecure: true
        batch:
          timeout: 5s
          send_batch_size: 100
        automatic_logging:
          backend: logs_instance
          logs_instance_name: default
          spans: true
          processes: true
          roots: true

According to this page (which I think is the most recent documentation)

The endpoint config host:port “must be the port of gRPC receiver”

This should be tempo.tempo:14250 in my case (Tempo deployed using Grafana helm chart, with defaults, except for the namespace)

Now that I read what I’m posting, I see I have used the port named tempo-jaeger-grpc although maybe I’m supposed to use the port named tempo-otlp-grpc?

Name:              tempo
Namespace:         tempo
Labels:            app.kubernetes.io/instance=tempo
                   app.kubernetes.io/managed-by=Helm
                   app.kubernetes.io/name=tempo
                   app.kubernetes.io/version=1.1.0
                   helm.sh/chart=tempo-0.7.7
Annotations:       meta.helm.sh/release-name: tempo
                   meta.helm.sh/release-namespace: tempo
Selector:          app.kubernetes.io/instance=tempo,app.kubernetes.io/name=tempo
Type:              ClusterIP
IP:                172.20.45.89
Port:              tempo-prom-metrics  3100/TCP
TargetPort:        3100/TCP
Endpoints:         10.236.113.35:3100
Port:              tempo-query-jaeger-ui  16686/TCP
TargetPort:        16686/TCP
Endpoints:         10.236.113.35:16686
Port:              tempo-jaeger-thrift-compact  6831/UDP
TargetPort:        6831/UDP
Endpoints:         10.236.113.35:6831
Port:              tempo-jaeger-thrift-binary  6832/UDP
TargetPort:        6832/UDP
Endpoints:         10.236.113.35:6832
Port:              tempo-jaeger-thrift-http  14268/TCP
TargetPort:        14268/TCP
Endpoints:         10.236.113.35:14268
Port:              tempo-jaeger-grpc  14250/TCP
TargetPort:        14250/TCP
Endpoints:         10.236.113.35:14250
Port:              tempo-zipkin  9411/TCP
TargetPort:        9411/TCP
Endpoints:         10.236.113.35:9411
Port:              tempo-otlp-legacy  55680/TCP
TargetPort:        55680/TCP
Endpoints:         10.236.113.35:55680
Port:              tempo-otlp-http  55681/TCP
TargetPort:        55681/TCP
Endpoints:         10.236.113.35:55681
Port:              tempo-otlp-grpc  4317/TCP
TargetPort:        4317/TCP
Endpoints:         10.236.113.35:4317
Port:              tempo-opencensus  55678/TCP
TargetPort:        55678/TCP
Endpoints:         10.236.113.35:55678
Session Affinity:  None
Events:            <none>```

Hi! Great, I’m happy it’s working for you now!

Yes, you should point to the OTLP gRPC receiver’s port. While Tempo and the Agent both can ingest in multiple formats, the Agent only exports in OTLP gRPC and HTTP.

This is telling that the Agent wasn’t able to push to that endpoint, since the receiving method opentelemetry.proto.collector.trace.v1.TraceService is unimplemented (it was pointed to a jaeger port as you mentioned.

Changing the remote_write to tempo.tempo:55680 should do the trick.

Note: OTLP default ports were changed to 4317 for gRPC and 4318 for HTTP. You may find those referenced in the documentation, but the old ones 55680 and 55681 are still supported.

Great! Thank you @mariorodriguez

I will try that out and report back on Monday.

Here is a brief description of the problems and the solution.

Task: Run Grafana Agent and Tempo in Kubernetes and route traces through the agent to Tempo

Problems faced

  • network policies - my company uses some default network policies that I had not fully undarstood

I was able to verify that network policies where allowing the traffic by spinning up a “troubleshooting container” in the namespace where my Grafana Agent is running and using curl to get the Tempo /metrics endpoint

  • I assumed that the needed OTLP gRPC port was open by default on Tempo

The default helm chart values.yaml does say “this configuration will listen on all ports and protocols that tempo is capable of” but the OTLP receivers are not listed. I have added this to my values.yaml

tempo:
  receivers:
    otlp:
      protocols:
        grpc:

This will add a OTLP gRPC receiver on the new default port 4317.

Thank you @mariorodriguez for the help :slight_smile:

1 Like