I am looking to instrument a build system based around GitHub Actions. I have an implementation that I’ll describe, but my data model seems wrong for using Tempo’s TraceQL capabilities, and I’m seeking advice on better design.
How GitHub Actions is set up:
- A top-level “project” repository contains the pull-request workflow.
- Each job in the project repository is either a call to a reusable workflow, or may do some work. The difference here is mostly a question of whether they have sub-jobs or just steps.
- The reusable workflows may have matrices in them, and they have their own jobs
My current data model is to start a trace and span for the top-level “project” workflow, and to have all other jobs be children of this top-level span.
I want to be able to create a dashboard that gives insight into things like:
- For a given machine type, what was the average time from job creation to job start time? This is a latency that we can work on with our hosted infrastructure.
- For a given job (task, like build-cpp, plus matrix variables for configuration), what does our build time look like over time? Over time really means “over commit space”, and it would be nice to be able to filter and group by git branches.
- Attach logs as span events for things like resource usage, cache hit statistics, etc.
Under my current model, it feels wrong. The queries seem designed to return traces, not child spans of traces. If I want to show a statistic, that statistic seems like it should be one-per-trace, and that I should be capturing the connection with the top-level project job using attributes instead. I believe that I should remove the top-level span, and have each child job have a unique trace ID.
Am I seeing things accurately, or have I just misunderstood the query language?
Relevant links (I am limited to 2 links; will post more in follow-up):
- Example shared workflow with matrix: shared-workflows/.github/workflows/conda-cpp-build.yaml at add-telemetry · rapidsai/shared-workflows · GitHub
- Example build log showing project hierarchy: adding telemetry (testing) · rapidsai/rmm@8d5473a · GitHub