fedrax
February 9, 2018, 12:12pm
1
Hi,
I’m having some issues with Grafana / Prometheus.
The graphs have inconsistente data. The first 2 or 3 times i refresh it, it shows different values on the same graph for the same timeframe.
Any idea what is causing this?
Could you provide some more details please. What does your query look like/how is the data incorrect etc.
fedrax
February 14, 2018, 2:53pm
3
Its a simple: mysql_slave_status_seconds_behind_master
I use prometheus and their mysqld_exporter
When i enter the dash first time
After refresh
Looks similar to this issue:
opened 01:47PM - 24 Jan 17 UTC
closed 02:53PM - 26 Jan 17 UTC
**What did you do?**
Looking at simple CPU usage metrics for a host, every ti… me Grafana refreshes the values on the graph change significantly for same metric reported by same Prometheus host.
**What did you expect to see?**
The metric to always report the same value for a given timestamp
**What did you see instead? Under which circumstances?**
Here is a simple graph panel in Grafana screen shots form a few seconds apart after refreshing the graphs:
![image](https://cloud.githubusercontent.com/assets/120915/22247364/54e8c6fa-e231-11e6-9314-df6f65845eb4.png)
![image](https://cloud.githubusercontent.com/assets/120915/22247365/5a9b2408-e231-11e6-9bd2-6b1bb692d84d.png)
Notice how not just the shape changes but the raw values are pretty different - one indicates CPU load of about 40% the other about 80%.
The query above is a relabelling of raw metrics and grafana stacking the graphs but to rule out either of those as issues, `cpu_usage_user{host="$host"}` where `$host` is a constant defined in the Grafana template does the same thing with a single line being displayed.
You can see source in this case is `prometheus.0` which is a direct source connecting to one of a pair of prometheus hosts to rule out anything odd with out load/balancing/failover and I *do* see the same behaviour if I pick the other replica `prometheus.1` too.
Hitting the refresh button results in a radical change in the graph shape even over past data points in maybe 30% of times hit.
It *only* seems to happen when looking at "last X" time range - picking an absolute time range in the past doesn't seem to cause it so it seems to be that the issue is caused by interpolating the data to stat at a specific time. I understand that will cause some differences in interpolation which may render sharp spikes differently etc. But the overall difference between 40% usage over most of 24 hours and 80% seems to huge a discrepancy to be expected rounding error. It throws into question how reliable any graph is if the value is so sensitive to what time range is selected (i.e. few seconds difference changes whole graph substantially).
I've tried difference resolution and step settings in Grafana and they don't seem to make a difference to the fact that refreshes substantially change the graph content (obviously they affect the graph in expected ways).
This *might* be a Grafana bug - I've not been able to trace the actual API calls they are making yet, however I've seen the Grafana code before and it seems to be doing the sane thing.
As a quick investigation, I've scripted the following: https://gist.github.com/banks/7d3e1b0f43d88a6fb8ca238d761eb784
- It issues queries to a single prometheus server for a single metric over a 24 hour period
- Each query it moves the start and end timestamps forward by 10 seconds (our samples are collected every 15 seconds)
- It does this over a range of 10 mins, i.e. all of these graphs are overlapping by at least 23 hours and 50 mins.
- Step size is 2 minutes (I tried different values and could reproduce the general result with any I tried).
- Each pair of lines in output shows start and end timestamps, then nest below the mean, max and min sample values seen.
- _Mean_ CPU usage percentage across 24 hours varies from ~23% to ~46% which is pretty substantial difference. The jumps seem to follow no pattern - occasionally the two steps seem to hit same samples and get identical result twice in a row but mostly every new samepl that is included/excluded significantly changes the overall data by far more than one would expect a single non-anomolous sample to skew things.
- Min and max vary quite a bit too
- I added full raw responses to two consecutive requests that had very different mean values (29 and 45) despite being just 10 seconds different in their start and end times across 24 hours...
I can provide more data if it helps, including full API responses to each query etc.
I'm fairly sure I've seen similar effects on other graphs (not just CPU usage percentage which is a calculated value) in the past and assumed it was to do with processing rate/irate and time interpolation, but this is a raw metric value as stored.
If this is somehow expected/explainable behaviour, it would be great to document how to reason about this - having which second the dashboard loads have such a huge impact on how the data actually gets shown seems to be a significant issue for a monitoring tool we rely on for understanding our systems.
**Environment**
* System information:
```
paul@prometheus.0 ~ $uname -srm
Linux 3.13.0-74-generic x86_64
```
* Prometheus version:
```
prometheus, version 1.0.1 (branch: master, revision: be40190)
build user: root@e881b289ce76
build date: 20160722-19:54:46
go version: go1.6.2
```
I realise this is not latest, we will upgrade soon. I've not seen issues/changelogs about this issue although it seems pretty significant.
* Prometheus configuration file:
```
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # By default, scrape targets every 15 seconds.
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets:
- prometheus.0.dblayer.com:9090
- prometheus.1.dblayer.com:9090
labels:
scrape_host: prometheus.0
- job_name: 'discovered'
file_sd_configs:
- files:
- /opt/prometheus/conf.d/*.yml
```
You could have a look at the Percona dashboard for mysql replication to see how they handle the step parameter:
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": false,
"iconColor": "#e0752d",
"limit": 100,
"name": "PMM Annotations",
"showIn": 0,
"tags": [
"pmm_annotation"
],
"type": "tags"
},
{
"builtIn": 1,
"datasource": "-- Grafana --",
This file has been truncated. show original