Grafana Prometheus OTEL - Filter By Label

Hello,

I am a big grafana noob. I have used it for the longest time with Loki to store and explore logs from my apps. But that was it. Now I am expanding into OpenTelemetry metrics.

I have my metrics being sent from my dotnet api to opentelemetry-collector, which are then exported to prometheus and jaeger (2.2.0).

This all works great and I was able to add this dashboard which shows data!! Don’t ask me how long that took because it was double digit days. Never-the-less, data is flowing into Grafana niceley.

My issue is I have labels being passed along into my prometheus data. The label is env and it has three possible values,

  1. Production
  2. Stage
  3. Local

Ideally, I would like to filter the data in that dashbaord one one or more of these values. This is where my noob part shows. I have tried adding in variables which successfully pull these values. But when I select them in the dashboard dropdown, nothing happens.

If I go into the dashboards JSON, and add them as filters next to the existing job and instance filters, the dashboard shows no results. If I try to edit a single dashboard manually add add that filter, all the data dissapears.

Does anyone know how I can do this?

I have other labels I will eventually want to filter on like appName.

Thanks!

Hi,

It seems like it should be working fine (you should be adding the env="$environment" part). What you could do is to make sure the data is present at all - in the case of your second screen it seems it is, but is equal to 0. Can you go to explore page (or create another visualization) and play around with the queries? See if following queries return some data:

  • sum(http_server_request_duration_seconds_count{job="otel-collector", instance="otel-collector:8889"}) by (env) - in this case look at the possible envs that you have. If the expression returns something like Value chances are that this metric does not have env label - by going with your cursor between 8889" and } and clicking ctrl + space you should get the list of possible labels in this metric.
  • http_server_request_duration_seconds_count - this will return all the data from the metric - inspect the available labels and their values.

Is it possible that your Stage environment didn’t have any requests? That would explain the 0 value - if you switch to Production does it show a value?

Also it seems like your http_server_request_duration_seconds_count metric is a counter - by default you should apply some function like rate or increase to view requests over time. Total requests as the title suggests might not be always right - e.g. if your application (pod?) restarts, the metrics will be lost (reset to 0), so you never actually know how much requests there have been throughout a period of time (not with that query at least :smile:).

Hope this helps a little and if not, feel free to reach out!

Thank you so much for replying @dawiddebowski!

I tried adding a new visualization from the explore tab, and for http_server_request_duration_seconds_count , it doesn’t see the env tag or any of its values. It is strange though because if I type {env="Production"} it finds both the label, and all three of its values. And it seems to even have results although I am not quite sure what it is measuring.

Even what I go to the variables screen on my dashboard, it finds all three environments but applying them doesn’t really do anything.

Per your question, I did try selecting production (very active atm), stage, and local and there is no difference. Local has very very little activity now and stage has a small amount. So there definitely should be obvious changes.

I added a gif screen recording of me going through the various screens so you can see. Hopefully it comes through alright.

I also really like your idea of updating total requests to be a more consistent metric. I will definitely do that once I get this part working.

Again, thank you for the time/help :slight_smile:

It doesn’t look like the gif is loading. This one should work

Ok, so:

What’s (probably) happening rn?

expression

{env=“Production”}

will return all the series that have the label env set to Production. From your git, it seems that there’s only one metric with such a label - target_info
image

Notice that not all the metrics must have the same set of labels, which most probably is your case - metric http_server_request_duration_seconds_count most probably does not have env label at all.

It is strange though because if I type {env="Production"} it finds both the label, and all three of its values.

Yes, it does because the metric target_info does provide this label and the three values but it doesn’t mean http_server_request_duration_seconds_count metric will have such a label.
I’m not sure I explain it well enough, so here’s another example:

Let’s say you have two applications - a Java one and a C one. In Java you might have a metric application_memory_usage. Since Java has quite different memory management than C, Java devs might have put label type in the application_memory_usage metric. So from Java application you will be reporting two series application_memory_usage{application="my_awesome_java_app", type="heap"} and application_memory_usage{application="my_awesome_java_app", type="off-heap"}. Whereas, in C there’s no such thing as heap or off-heap memory (at least I hope so, I almost failed that subject :smile:), so there’s no reason to put type label and C application will only expose application_memory_usage{application="my_awesome_c_app"} metric.
What would happen when you type in application_memory_usage{application="my_awesome_c_app", type="heap"} in the query window. Prometheus would want to find all metrics of name applciation_memory_usage from my_awesome_c_app with memory used of type heap. But there’s no such thing! Hence, empty result.

The same thing is happening to you rn. There’s a metric http_server_request_duration_seconds_count but it doesn’t have env label, so Prometheus won’t find any match. It’s like saying you need a computer that is both Mac and produced by HP. You can ask for one, but there’s none.

Why is it happening?

AFAIK Otel collectors push relevant resource information in target_info metric (I’m not using that so I’m not sure) to limit the number of labels in other metrics (in your instance http_server_request_duration_seconds_count.

How can I fix it?

Can you type in explore both of the following queries (exactly like that, nothing more :smile:):

  • target_info{env="Production"}
  • http_server_request_duration_seconds_count

And copy the result labels?
image

You can join one query with another to get the desired result but to help I need to know what are available labels on both sides (there should be one that is common for them). You should blur the values that are sensitive (e.g. some IPs etc.) but don’t blur out the names of the labels, they are important (also if the common value is sensitive, you can blur the value but please tell which label has the same value for both results).

Alternatively you probably could go into otel collector config and play around with adding env label to the metrics - this might be better idea if you have access for that (unfortunatelly, I’m not quite sure how to do that myself).

2 Likes

Thank you again for taking the time!

I actually figured it out! Based on what you were saying, it seemed like the metrics just had no info on the labels. I took a step back and looked at the entire pipeline agian.

In dotnet, I add the “labels” (really attributes) like this

// configure metrics for grafana
var otel = builder.Services.AddOpenTelemetry();

// Configure OpenTelemetry Resources with the application name
otel.ConfigureResource(resource =>
  {
    resource.AddService(serviceName: $"{appName}");
    var globalOpenTelemetryAttributes = new List<KeyValuePair<string, object>>();
    globalOpenTelemetryAttributes.Add(new KeyValuePair<string, object>("env", env));
    globalOpenTelemetryAttributes.Add(new KeyValuePair<string, object>("appId", appId));
    globalOpenTelemetryAttributes.Add(new KeyValuePair<string, object>("appName", appName));
    resource.AddAttributes(globalOpenTelemetryAttributes);
  });

// Add Metrics for ASP.NET Core and our custom metrics and export to Prometheus
otel.WithMetrics(metrics => metrics
  ...

Then it goes to my open telemetry collector to prometheus. I had a feeling something was getting lost here. And in fact it was. Thanks to this answer I was able to find why they were being dropped and how to fix it.

Basically, resource attributes to metric labels are disabled in prometheus by default. So adding this fixes it

exporters:
    prometheus:
      endpoint: "0.0.0.0:8889"
      resource_to_telemetry_conversion:
        enabled: true

As for the suggestion in your first comment, I updated total requests to be total requests in last n minutes using this

sum(increase(http_server_request_duration_seconds_sum{job="$job", instance="$instance", env="$environment", appName="$appName"}[$__range]))

Thank you again really! You were incredibly helpful in guiding me on this :slight_smile:

1 Like