How do I join target_info into OpenTelemetry metrics?

  • What Grafana version and what operating system are you using?
    docker-otel-lgtm v0.6.0

  • What are you trying to achieve?
    I am trying to merge metrics with their corresponding target_info.

  • How are you trying to achieve it?
    given_metric * on(instance) group_left(field_to_merge) target_info

  • What happened?
    I received an error stating that I can’t do a many to many join.

  • What did you expect to happen?
    Based on the documentation I’ve read, I should be able to do a group_left to merge in target_info to a given metric so that I can add labels that would otherwise result in too high a cardinality if I were to just dump everything into labels in the metric itself.

  • Can you copy/paste the configuration(s) that you are having problems with?
    process_cpu_time_seconds_total * on(instance) group_left(process_pid) target_info

  • Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.
    execution: found duplicate series for the match group {} on the right hand-side of the operation: [{…data…}] many-to-many matching not allowed: matching labels must be unique on one side

  • Did you follow any online instructions? If so, what is the URL?
    Using OpenTelemetry and Prometheus: A practical guide to data collection

Additional notes and troubleshooting:
I am able to get this working if I specify a single PID, however I don’t want to chart a single PID, I want to chart the metric for each possible PID. This also maps to several other metrics I want to visualize similarly in the future. I need to be able to merge in the target_info because if I put all the data into metric labels then I blow out my cardinality, but when I try to merge the labels I either have to filter to a single value or I get many:many merge errors. I tried a few different query structures but everything I’ve done so far has resulted in either errors or a table that doesn’t contain all the data I need.

The OTEL LGTM stack I’m using is docker-otel-lgtm. I didn’t change anything in the config, just downloaded it, unzipped, built the docker image, then ran it using the run script.

I’m using a standard OTEL Collector pointed to the OTEL LGTM stack. All I’m running right now is hostmetrics and resource detection. I am about 6 months behind on collector version, but I don’t think that is the issue since the metrics and target info show up correctly in Prometheus. Here is my yaml:

receivers:
  hostmetrics:
    collection_interval: 60s
    scrapers:
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
      disk:
      filesystem:
        metrics:
          system.filesystem.utilization:
            enabled: true
      load:
      memory:
        metrics:
          system.memory.utilization:
            enabled: true
      network:
      paging:
        metrics:
          system.paging.utilization:
            enabled: true
      process:
        mute_process_exe_error: true
        mute_process_io_error: true
        mute_process_user_error: true
        mute_process_name_error: true
        metrics:
          process.cpu.utilization:
            enabled: true
          process.cpu.time:
            enabled: true
      processes:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200
  resourcedetection:
    detectors: [env, system]
    system:
      hostname_sources:
      resource_attributes:
        host.id:
          enabled: true
        host.ip:
          enabled: true
        host.mac:
          enabled: true
  batch:
exporters:
  otlphttp:
    endpoint: http://localhost:4318
service:
  telemetry:
    metrics:
      level: none
  pipelines:
    metrics/agent:
      receivers: [hostmetrics]
      processors: [memory_limiter, resourcedetection, batch]
      exporters: [otlphttp]

I may have figured out the answer. I think the problem is a fundamental misunderstanding of how Prometheus works, and how the joins are occuring. I thought there was something linking each metric instance to it’s target_info, but now I think those are just two separate datasets and since the join is happening on instance it’s basically just doing a timestamp merge. That means that I have to increase the number of labels in my metric side until at a minimum it has a unique timestamp correlation with target_info. This means I can’t just offload all this data into target_info if it is actually required for selecting and distinguishing metric values.

Example: since the metric process_cpu_time_seconds_total only makes sense in context of the actual individual process, then I NEED to put something in the metric labels to distinguish the different processes. Which means I’ll end up with a very high cardinality metric, so if I want that data I need to understand it comes with the tradeoff of higher “cost” since it’ll produce a higher number of unique metric series. This also goes for the host information. If I need to be able to distinguish system metrics across hosts, I can’t assume the timestamp will be enough to separate them, so host_hostname or host_id will need to be a label in my metrics instead of being in target_info.

If someone with more experience can check my math on that, I would greatly appreciate it.

2 Likes