Strange delays in trace availability

msarahan · December 13, 2024, 7:44pm

You may remember me from Data model advice for GitHub Actions workflows - #6 by msarahan. I’m parsing GitHub Actions job metadata and sending the resultant spans to Tempo. I’m using the opentelemetry-python SDK with the opentelemetry-exporter-otlp-proto-http exporter.

My issue is that traces sometimes take several hours to show up in Tempo. Sometimes they don’t show up at all. I don’t see any log options that would help diagnose this. I have turned on logging for the spans on the distributor, but nothing shows up in the logs.

What I have done to try to isolate the problem:

sent same traces to Jaeger. They work fine.
Tried both the python OpenTelemetry SDK and the Go-based Opentelemetry (via GitHub - equinix-labs/otel-cli: OpenTelemetry command-line tool for sending events from shell scripts & similar environments) to verify that the python SDK isn’t involved. Same result with Jaeger working fine, but Tempo delayed or missing.

A simple script that sometimes reproduces the issue is:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor, SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

from typing import Dict

import logging
logging.basicConfig(level=logging.DEBUG)
from time import time_ns, sleep
from pathlib import Path

SpanProcessor = SimpleSpanProcessor
trace.set_tracer_provider(
    TracerProvider(resource=Resource.create({"service.name": "service1"}))
)
trace.get_tracer_provider().add_span_processor(SpanProcessor(OTLPSpanExporter()))

tracer = trace.get_tracer("tracer.one")
with tracer.start_as_current_span("some-name") as span:
    span.set_attribute("key", "value")

third_provider = TracerProvider()
third_provider.add_span_processor(span_processor=SpanProcessor(OTLPSpanExporter()))
third_tracer = trace.get_tracer("GitHub Actions parser", "0.0.1", tracer_provider=third_provider)
root_span = third_tracer.start_span("workflow root", start_time=time_ns())
root_context = trace.set_span_in_context(root_span)
sleep(0.5)
step_span = third_tracer.start_span(name="steppy", start_time=time_ns(), context=root_context)
sleep(0.5)
step_span.end()
root_span.end(time_ns())

When the issue shows up, it manifests as the latter trace either not showing up, or not showing up with the “steppy” child span. I can’t reliably reproduce this way, but since it happens sometimes, I thought it was worth showing.

I have posted a more complicated example that always reproduces the issue in a gist: README.md · GitHub

If I can provide any diagnostic logging that would be helpful, please let me know.

EDIT: I should mention that Tempo was working fine when each job was sending its own trace data, rather than consolidating it to one step at the end. We needed to consolidate it for the sake of mixing self-hosted and GitHub-provided runners, and we can’t communicate with our Tempo server from the Github-provided runners.

The total number of spans is about 900, and filtering data points less than 2 seconds prior to creating spans cuts that down to about 330.

msarahan · January 6, 2025, 8:50pm

Update on this: I discovered that the traces ARE available. They just don’t show up in the search window. It’s probably some kind of indexing thing with the search functionality.

Topic		Replies	Views
Grafana Tempo is not showing traces Grafana Tempo kubernetes , agent , tempo , opentelemetry , traces	6	2577	March 5, 2025
Tempo cannot receive trace in Windows Grafana Tempo	4	1402	September 15, 2023
Tempo canot search latest trace Grafana Tempo	7	948	May 16, 2024
Grafana Explore page does not show the tracing result for a Tempo data source Grafana	2	1576	February 19, 2022
How to configure traces for tempo using only logs from Loki Grafana Tempo loki , tempo , grafana , traces	1	523	February 27, 2025

Strange delays in trace availability

Related topics