Data model advice for GitHub Actions workflows

I am looking to instrument a build system based around GitHub Actions. I have an implementation that I’ll describe, but my data model seems wrong for using Tempo’s TraceQL capabilities, and I’m seeking advice on better design.

How GitHub Actions is set up:

  • A top-level “project” repository contains the pull-request workflow.
    • Each job in the project repository is either a call to a reusable workflow, or may do some work. The difference here is mostly a question of whether they have sub-jobs or just steps.
    • The reusable workflows may have matrices in them, and they have their own jobs

My current data model is to start a trace and span for the top-level “project” workflow, and to have all other jobs be children of this top-level span.

I want to be able to create a dashboard that gives insight into things like:

  • For a given machine type, what was the average time from job creation to job start time? This is a latency that we can work on with our hosted infrastructure.
  • For a given job (task, like build-cpp, plus matrix variables for configuration), what does our build time look like over time? Over time really means “over commit space”, and it would be nice to be able to filter and group by git branches.
  • Attach logs as span events for things like resource usage, cache hit statistics, etc.

Under my current model, it feels wrong. The queries seem designed to return traces, not child spans of traces. If I want to show a statistic, that statistic seems like it should be one-per-trace, and that I should be capturing the connection with the top-level project job using attributes instead. I believe that I should remove the top-level span, and have each child job have a unique trace ID.

Am I seeing things accurately, or have I just misunderstood the query language?

Relevant links (I am limited to 2 links; will post more in follow-up):

Not exactly what you asked, but I am sharing few links that might be helpful for your use-case:

Checkout this Intro to CI/CD Observability and the Grafana LGTM Stack Webinar

CI/CD O11y Blogpost, OpenTelemetry Proposal (closed but tons of good discussion in there) and grafana-ci-otel-collector project.

1 Like

Thanks for the ideas. It’s good to see that Grafana might be getting some built-in (or plug-in) support for this kind of thing. I think I need more flexibility for our application, but it is good to see some similar ideas and traces in the webinar and blog post.

The split between how to query and how to display multiple “things” - spans or traces - is either confusing or not possible. For example, my initial data model combined many “jobs” under one trace (each job in here is collapsed):

I changed it to have just one trace per job:

I believe my real problem is my clumsiness with TraceQL, not my data model. I think that I can use resource attributes to gather and differentiate specific data that make up the many traces. I’m doing queries like:

{resource.operation=~"build" && resource.arch=~"amd64" && resource.cuda=~"11.4" && package_type=~"conda" && span.name=~"delay time"}

I think that whether the jobs are their own traces or not, this query will give the same result. However, the difference in having the jobs all be nested under the one trace is that the trace view can show them all in context. I don’t see a way to select and view multiple traces in the trace viewer at once. Because of this, I think I’m going to revert to my original model of having one mega-trace per run with a child span for each job in that run.

I think I’m stuck on wanting to get the duration of individual spans. Maybe this depends on traceql metrics? Configure TraceQL metrics | Grafana Tempo documentation

I have 2 different spans in one trace that I want to compare. I can isolate these so that I have only one unique span per trace, but the parent trace is the same across queries:

{span:name="Start delay time" && resource.rapids.arch="amd64" && resource.rapids.cuda=~"11" && resource.rapids.operation="build-cpp"}

and

{span:name="Start delay time" && resource.rapids.arch="amd64" && resource.rapids.cuda=~"12" && resource.rapids.operation="build-cpp"}

Is there any way to query for the span data itself, instead of using the span data/metadata solely as the way to filter traces?

does using select to select the span attribute work for your use case?

for example: I ran this query: { resource.service.name="tempo-ingester"} | select(span:duration, trace:duration, name) and selected span duration and trace duration by selecting the Intrinsic fields,

see here:

and I can look at the spans by using Table Format: Spans like this:

I apologize for the delayed response here. I missed the notification. The table information is exactly what I’m after, but I need to be able to plot it. I have a working prototype in Python using the http API.

def format_query_string_for_q(params=None, reverse_name_map=None):
    values_str = "{"
    values_parts = []
    for k, v in (params or {}).items():
        if reverse_name_map and k in reverse_name_map:
            k = reverse_name_map[k]
        if v:
            if hasattr(v, 'lower'):
                if k in ['name', 'rapids.labels']:
                    values_parts.append(f'{k}=~"{v}"')
                else:
                    values_parts.append(f'{k}="{v}"')
            else:
                values_parts.append("(" + " || ".join(f'{k}="{iv}"' for iv in v) + ")")

    values_str += " && ".join(values_parts) + "}"
    return values_str

def retrieve_data(base_url, start=-5, end=datetime.datetime.now(), params=None, reverse_name_map=None):    
    if isinstance(start, int):
        start = end + datetime.timedelta(start)

    values_string = format_query_string_for_q(params=params, reverse_name_map=reverse_name_map)
    query_params = {
        "limit": 100,   # Number of traces (each has multiple spans)
        "spss": 1000,    # Number of spans per span section (roughly one span section per trace)
        "start": int(start.timestamp()),
        "end": int(end.timestamp()),
        "q": values_string
    }
    response = requests.get(urljoin(base_url, "api/search"),
                            params=query_params,
                            cert=(client_cert.as_posix(), client_key.as_posix()),
                            verify=ca.as_posix())
    if response.status_code >=400:
        print(response)
    df = pd.json_normalize(response.json()['traces'])
    try:
        expanded = pd.concat([df.filter(["traceID", "startTimeUnixNano", "durationMs"]),
                        df.explode("spanSet.spans")['spanSet.spans'].apply(pd.Series).rename(
                            columns={"startTimeUnixNano": "spanStartTime", "durationNanos": "spanDurationM"}
                            )], axis=1)
        expanded['spanDurationM'] = (pd.to_numeric(expanded['spanDurationM'], errors='coerce').fillna(0)/1E9/60)
        expanded['spanStartTime'] = pd.to_datetime(pd.to_numeric(expanded["spanStartTime"])/1E9,unit='s')
        # cut off jobs with delay times longer than 10 hours
        expanded = expanded.query('spanDurationM < 600')
    except KeyError:
        print(df)
        raise
    col_subset = expanded.filter(["traceID", "startTimeUnixNano", "durationMs", "spanID", "spanStartTime", "spanDurationM"])
    return col_subset

This feeds into a plot/dashboard tool that allows us to filter on different resource attributes that we set. Each point in this plot represents one span that matches the filters. One trace may contribute more than one point in this plot. Each row of widgets is a “dimension,” and selecting more than one value for a dimension creates a separate series for each combination of all dimensions.

It would still be desirable to do this kind of plot in Grafana, because a unified view of the individual trace viewer and other dashboard creation tools would be nicer than a Python dashboard that we have to host separately.