Tracing Custom Metrics

I’m doing a trade study whose goal is to identify a solution to trace custom domain metrics each having a single correlation ID or a set of correlation IDs acting collectively as a compound correlation key. All metrics are aggregated by a single process scraped by Prometheus and visualized by Grafana 8.1.2.

If we used Tempo, we’d deploy it into our Kubernetes cluster(s) via its Helm chart. We’d configure Tempo to use an S3 backend.

How would you recommend getting our (easily transformable) custom metrics into Tempo? Would it be better to:

  • Use some library (?) to push metrics to Tempo?
  • Dump metrics directly into S3?

Aside from acquiring Tempo, do we need any other OSS?

The end goal would be for someone to use one or more of the different types of correlation IDs to identify traces via Grafana.

Tempo is a tracing backend, we can only ingest and store trace data. I do not recommend pushing data directly to the S3 bucket since data must be in the correct format for it to be readable by Tempo.

If you wish to store your metrics as traces, you will have to convert your metrics into a trace format first and send it to Tempo ingest. I’d recommend using OpenTelemetry’s format OTLP, there should be an SDK for most languages.

That said, is there a particular reason you wish to store your custom metrics as a trace in Tempo? I’d expect a metrics backend to be more suited for this data.

1 Like

Thanks for the fast response!

We’re already using Prometheus and Grafana for metric trending. The problem we’re trying to solve is how to analyze particular end-to-end chains of events. We can monitor the aggregate trending of events, but we want the ability to do diagnostics for particular event IDs. Prometheus can’t retain the original correlation IDs because they are countably infinite. Distributed tracing (e.g., via Tempo) sounds like it would allow us to continue to use Grafana but augmented using a new Tempo data source whereby a user can search for a particular correlation ID and the end-to-end flow.

I’ve used Zipkin and Sleuth/Brave on another project. I’m assuming the Tempo data model and architecture are similar.

One challenge in converting formats is mapping our correlation IDs to trace and span IDs. We either need to:

  • Use our correlation IDs as trace/span IDs - possibly challenging considering we have multiple types of correlation IDs that all follow different conventions than that required by trace & span IDs
  • Put our correlation IDs into trace baggage and/or span tags, then use Grafana to search for traces by baggage or tags (not the trace IDs themselves)

It is really hard to understand you, because you are mixing traces with metrics. But my guess is that you need a logs. Keep in mind logs ! = traces ! = metrics.
Publish everything what you need to the logs (structure based on your needs) with correlation ID. Process&store those logs with your favorite log tool/storage (e. g. loki) and visualize them (with correlation ID filter) with your favorite visualisation tool (Grafana).

Thanks again for the reply. Sorry for not being clear.

  • We have logs, but they are unreliable at scale & load. While I’m confident there’s a technical solution, it is not feasible given other non-technical constraints.
  • We have Prometheus metrics, but they lack the specific (countably infinite) UUIDs.
  • Tracing metrics using our internal domain specific UUIDs would be ideal. For the sake of my trade, I’m going to assume I can use a library to emit traces to Tempo backed by S3 and (eventually?) query for trace IDs and domain specific UUIDs in Grafana. Let me know if this is an invalid assumption.

Why you don’t use better metric storage for your prometheus which doesn’t have a problem with high cardinality (e. g. Mimir)?

I bet you will find technical limitations with trace approach as well. E. g. you can search only for 30days time range by default (log solutions doesn’t have this limitation usually). So how do you solve it if you have correlation ID but it is historic - will you recommend your users to search whole stored trace history in 30days chunks?

I still see a proper logging solution as the best approach (but of course I don’t see all limitations on your side).

I think the main issue you are dealing with is that most metrics backend aggregate data and that you lose specific data points right?

You can store these data points as traces in Tempo and look them up individually. Tempo is designed to store traces, so the data has to be ‘trace-shaped’. That said, nothing is stopping you from using it as a more generic event store. You can, for instance, send one span for every data point and store your metadata in the span attributes. We don’t have limits on the amount of attributes or their size.

If it would be possible to convert each data point into a single log line, Loki might also be a good solution here. They have more advanced query capabilities than Tempo.

Trace IDs are just 16 byte arrays. As long as they are unique you can store whatever in them.

Yes, this should work. Tempo will be able to search on all attributes on your spans, but depending on your ingest volume searching over the full range might become too slow. Searching a trace by its ID will be much more efficient than searching it using a custom ID stored in the attributes.
We have some work in the pipeline to improve search by switching over to Parquet blocks, but this is still a while out.

Tempo can ingest various formats including OTLP, Zipkin and Jaeger. So if you use a library that can export to that you should be fine.

The retention period and search range is configurable. So while I think the default is 30 days, you can choose to store traces for longer. But note this might slow down search again.

Thanks @koenraad and @jangaraj for your responses. I’m excited to see where Tempo goes.

The leading competitor in my trade is dumping my events to MongoDB and using it as a Grafana data source. My event data already has a well defined, backwards compatible JSON format across all types of events (not too dissimilar to traces). This option doesn’t have breadth of potential that the 3 pillars bring, but it may be more practical given our other constraints and objectives.

Thanks again.