Very different values for different time ranges

grafana shows very different cpu values on different time ranges. This is the query:

100 - (avg by (cpu) (irate(node_cpu{mode=“idle”, instance=~"$server"}[1m])) * 100)

When I choose a 6-hour time range, the percentage on the same period is much higher than that of a 7-hour time range - probably because of a spike around 7 hours ago.
Please compare the two pictures.


And the 6-hour range:

This looks like a grafana bug. I’m using Ubuntu 16.04.5 with grafana 5.2.2.

What’s the frequency of datapoints per minute? Does it help if you increase the range vector selector to “[5m]”?

Hi,
Thanks for the answer.
As can be seen in the query, the frequency is 1 minute. I changed it to 5 minutes. It’s exactly the same.
[update]
I changed it to 5 minutes and then back to 1 minute, then changed the time range several times and now it seems to be displaying the correct data. I’m not sure what’s going on. I’ll leave it be for a while and get back to you.
It’s nonetheless rather annoying, because the values are (were) misrepresented.

By the way, the resolution of the graph is 1/10. Not sure if this is the right one, but other resolutions seem to be displaying the data completely disproportionately.

After playing with it a little bit, it’s still the same. I’ve got the same problem. If the time range is higher than a certain time range (and, essentialy, it includes certain spikes), then it can’t solve the differences, it doesn’t observe the proportions. It’s as simple as that.

I would like to revive this thread, as I really don’t understand why grafana still behaves like that after one year with very different metrics.
I’ve got the following screenshots:
This one displays the time for the last hour:

Whereas this one for the last three hours:

As you can see, I get extremely different values, which makes absolutely no sense to me. Can anyone explain to me what is going on?

This are the queries I’m using:

irate(node_disk_reads_completed_total{instance=~"$node:$port",job=~"$job",device=~"[a-z][a-z]"}[1m])
irate(node_disk_writes_completed_total{instance=~"$node:$port",job=~"$job",device=~"[a-z]
[a-z]"}[1m])

The unit I’m using is i/o ops/sec.

I’m using grafana 6.4.4 from the official docker image.

I had problems like this on our HPC-Cluster in the University when we used Prometheus,
but since we switched to InfluxDB/Telegraf this problem is gone.

Did you check if the query graph in Prometheus itself shows the same behavior?

1 Like

The issue is in irate() function. It calculates the rate based on two adjancent samples instead of calculating the rate based on all the samples per each step between points on the graph. When the time range for the graph changes, the set of samples used for irate() calculation also changes. So you see completely different graph. The solution is to use rate() instead of irate(). See the following article for details - Why irate from Prometheus doesn't capture spikes | by Aliaksandr Valialkin | Medium

1 Like

So is irate basically crap?

1 Like