Difficulties to display the right information with rate/increase PromQL function

Hi,
I’m having some difficulties to display some values and to interprete them.
My setup is :

sFlow RT and Prometheus Exporter (to export network metrics in bits/sec) => Prometheus (where metrics are created/filtered) => Grafana (to display metrics, in Gbits/sec)

Prometheus sFlow metrics is kind of simple. This exporter allow to retrieve bits/sec value for every VLAN

  - job_name: 'sflow-rt-flow-vlan-ingress'
    metrics_path: /app/prometheus/scripts/export.js/flows/ALL/txt
    static_configs:
      - targets: ['my_prometheus:8008']
    params:
      metric: ['vlan-metric-ingress']
      key: ['vlan']
      label: ['vlan']
      value: ['bytes']
      scale: ['8']
      minValue: ['1000']
      maxFlows: ['100']
      filter: ['direction=ingress']

From this data, i built a dashboard to display per-VLAN throughput on a network uplink, i.e :

Then, I’d like to get the total amount of data per VLAN in a table to properly identify which VLAN consumes the more data for a timeslot.( i.e, we can have VLAN with constant low throughput vs VLAN with high peak throughput)

In order to achieve this, I was considering using rate or increase function, however it seems I got some extravagant values.

  • Using “rate” query :
sum by(vlan) (rate(vlan_metric_ingress[1h]))

If I understood, it means it evaluates the average by vlan over the last hour

Grafana table look like this :

  • Then, using the “increase” query :
sum by(vlan) (increase(vlan_metric_ingress[1h]))

If I understood, the increase function calculates the total counter’s value over the last hour.

My question is I’m not sure to 100% understand the main difference between them, and I don’t know which function is the right to use (or probably there’s another way of doing this ? )

Bonus question :
If I use increase function with 1h time range, how the result is calculates when Grafana UI timeslot is 3 last hours ?

Hi,

so:
Rate function calculates per-second rate of counter. An example could be:

You take the first and the last point in the desired time frame (in the picture 5m, in your case 1h), subtract the first from the last and then you divide it by the number of seconds that were between those points (in the picture - the first point =1, the last = 241. 241 - 1 = 240. 240 / 240 (there are four “in betweens” between the points, every “in between” lasts 60 seconds) = 1. That’s the rate result).
In your case the query with rate will show how much data per second was sent through the VLAN over the last hour.

On the other hand, increase does not divide by the number of seconds. Increase takes the first point and the last point and does the subtraction between them.

(please notice that the pictures are some kind of simplification, as the counter resets are included in promql and are not shown here)

Truth be told, increase is a syntactic sugar on rate (as pointed here), so that’s the main difference.

For your case, I would use increase, since you want to know the total amount of data in the last hour. The only question is - is the vlan_metric_ingress metric a counter? To use increase or rate the metric should be counter. If not, a different approch might be needed.

Bonus question:
The timeslot in Grafana does not matter in this case. It’s just there to indicate how wide in time will the plot be, but it doesn’t change the increase value, as the [1h] is constant. Every point in the plot will show the last hour, relatively to where it exists on the plot (e.g. point in the plot placed at 12:00 will show total traffic from 11:00 to 12:00, and the point in the plot placed at 11:00 will show total traffic from 10:00 to 11:00).

1 Like

Hi @dawiddebowski
Thanks a lot for this outreach and explanation.

It means using rate() we can say : here’s the “evaluated” amount of data in the last hour ( where unit may be i.e something per second), it includes changes
Or increase(), we can say : here’s the total amount of data in the last hour (where unit is like total number of something)

The only question is - is the vlan_metric_ingress metric a counter? To use increase or rate the metric should be counter. If not, a different approch might be needed.

I saw this week end that my Prometheus raise a warning about that :

I didn’t found anything on sFlowRT source code that say if it’s a counter or a gauge.

However, if we agree a counter is a value that can only increase and a gauge is a value that can increase/decrease, so i’d say my metric seems to be a gauge.

To be precise, counter can only increase or reset to 0. If the metrics are exposed in OpenMetrics format, there should be HELP comment with the metric name and type in the file.

If the metric turns out to be gauge, it’s somewhat harder to get what you want. Gauges are tricky to deal with because of the intervals the metrics are scrapped at. Let’s say you scrape your metrics in the default 15 seconds intervals. Gauge metrics observe the value in the interval but you have no information about the points that were not observed. For example look at the image below: the yellow points represent when the gauge is scraped, while the green points represent actual data. You can see that there are multiple points that are missing (the yellow dots are meant to be only time indicators, not actual numerical value).

IMHO the metric like this should be a counter (the name might not reflect that). I wouldn’t really get what that metric means as gauge - it would be how many bytes were processed… when? Gauges are usually used for things like current memory usage or current number of threads spawned (e.g. in Java). So one might need some documentation to get what this metric means. If you find none, you can also see if it’s a counter or a gauge by querying some longer period of time - if a single line never drops, chances are that it might be a counter. If it drops, it’s a gauge (by “drop” I mean a relatively small drop, not reset to 0). But it’s not 100% fool-proof.

So what if it’s a gauge. You can leverage sum_over_time function from promql and do a query like:
sum_over_time(vlan_metric_ingress[1h:<your scrape interval (by default 15s)>])
You don’t need any rate or increase as the metric gives you the value as-is. sum_over_time will calculate the sum over last first argument of lookbehind window ([]) assuming the interval of the points in between is second argument of lookbehind window ([]) (more info here and here.

2 Likes

Thanks again for all these explanations about gauge vs counter. PromQL functions are not so easy to deal with at first sight :smiley:
I’ll check but yeah it’s probably more a counter than a gauge according to the explanations. (for a 1d query, nothing drops)

I made another test yesterday, while querying with increase(vlan_metric_ingress[10m]) , I download a 10GB file to check if my total amount of data on a specific VLAN grows up by 10GB, and it seems yes (+/- few percents but it sounds nice).

Funny thing, there’s like a time offset while value increase, I mean, if download really happened between 01:00 and 01:10 PM, i’ll get the value growing between 01:05 - 01:15/20 PM.

Nevertheless, it seems that query and function shows the right information this time :smiley:

EDIT: also, got less amount of data in 6 last hours( 80GB) rather than 3 last hours (ex: 170GB), i think there’s something about grafana timestamp and/or how promql evaluates query over time, but anymay, it gives a trend.