You may remember me from Data model advice for GitHub Actions workflows - #6 by msarahan. I’m parsing GitHub Actions job metadata and sending the resultant spans to Tempo. I’m using the opentelemetry-python SDK with the opentelemetry-exporter-otlp-proto-http exporter.
My issue is that traces sometimes take several hours to show up in Tempo. Sometimes they don’t show up at all. I don’t see any log options that would help diagnose this. I have turned on logging for the spans on the distributor, but nothing shows up in the logs.
What I have done to try to isolate the problem:
- sent same traces to Jaeger. They work fine.
- Tried both the python OpenTelemetry SDK and the Go-based Opentelemetry (via GitHub - equinix-labs/otel-cli: OpenTelemetry command-line tool for sending events from shell scripts & similar environments) to verify that the python SDK isn’t involved. Same result with Jaeger working fine, but Tempo delayed or missing.
A simple script that sometimes reproduces the issue is:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor, SimpleSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from typing import Dict
import logging
logging.basicConfig(level=logging.DEBUG)
from time import time_ns, sleep
from pathlib import Path
SpanProcessor = SimpleSpanProcessor
trace.set_tracer_provider(
TracerProvider(resource=Resource.create({"service.name": "service1"}))
)
trace.get_tracer_provider().add_span_processor(SpanProcessor(OTLPSpanExporter()))
tracer = trace.get_tracer("tracer.one")
with tracer.start_as_current_span("some-name") as span:
span.set_attribute("key", "value")
third_provider = TracerProvider()
third_provider.add_span_processor(span_processor=SpanProcessor(OTLPSpanExporter()))
third_tracer = trace.get_tracer("GitHub Actions parser", "0.0.1", tracer_provider=third_provider)
root_span = third_tracer.start_span("workflow root", start_time=time_ns())
root_context = trace.set_span_in_context(root_span)
sleep(0.5)
step_span = third_tracer.start_span(name="steppy", start_time=time_ns(), context=root_context)
sleep(0.5)
step_span.end()
root_span.end(time_ns())
When the issue shows up, it manifests as the latter trace either not showing up, or not showing up with the “steppy” child span. I can’t reliably reproduce this way, but since it happens sometimes, I thought it was worth showing.
I have posted a more complicated example that always reproduces the issue in a gist: README.md · GitHub
If I can provide any diagnostic logging that would be helpful, please let me know.
EDIT: I should mention that Tempo was working fine when each job was sending its own trace data, rather than consolidating it to one step at the end. We needed to consolidate it for the sake of mixing self-hosted and GitHub-provided runners, and we can’t communicate with our Tempo server from the Github-provided runners.
The total number of spans is about 900, and filtering data points less than 2 seconds prior to creating spans cuts that down to about 330.