Grafana Dashboard results doubled due to 2 Prometheus replicas

  • What Grafana version and what operating system are you using?
    Grafana v10.0.3 on Kubernetes Cluster

  • What are you trying to achieve?
    Build a Dashboard with some basic information about the Cluster like Node count, Pod count…

  • How are you trying to achieve it?
    sum(cluster:master_nodes)

  • What happened?
    Showed 6 Master Nodes

  • What did you expect to happen?
    Showing 3 Master Nodes

  • Did you receive any errors in the Grafana UI or in related logs? If so, please tell us exactly what they were.
    I didnt receive any errors.

I looked into the query responses and saw that i get the same data from 2 Prometheus replicas. Is there an option to consolidate the data if its the same but from 2 replicas? I did some googling and found how to do it if you have 2 seperate datasources e.g. Prometheus and InfluxDb. but in my case it counts as 1 Datasource in grafana since its one Prometheus that hast 2 replicas due to HA.

Cheers
Marcel

1 Like

Hi everyone,

I’m also encountering this issue after increasing the Prometheus replicas to 2 for high availability (HA). This is causing the following metrics to display incorrect values, effectively doubling them:

  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits
  • cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests
  • kube_node_status_capacity
  • kube_node_status_allocatable
  • Any metric with the instance label

Problem:

The main issue is that each metric appears to be logged twice, causing doubled results when summed. For example, metrics like node capacity and allocation display duplicate values for the same nodes.

Example:
The table format shows that there are multiple entries for the same node, each with identical capacity values. See the tables below:

node instance Capacity
node-01.example.com [1234:abcd::1]:9090 128 GiB
node-01.example.com [1234:abcd::2]:9090 128 GiB

Here’s an example confirming that different instance values actually correspond to the same node UUID:

node instance system_uuid
node-01.example.com [1234:abcd::1]:9090 uuid-1234-abcd-5678-efgh
node-01.example.com [1234:abcd::2]:9090 uuid-1234-abcd-5678-efgh

Cause:

From my investigation, it seems that each node in the cluster reports the same two instance values as every other node. These instance values change over time but remain identical across the cluster at any given time. The metrics are correct when only one instance is present, so this duplication from both Prometheus replicas seems to be at the root of the issue.

Questions:

If anyone has dealt with this before, I would appreciate any guidance on:

  • Understanding the origin of the instance label in these duplicated metrics
  • Identifying why each node is reported with multiple instance values

Thanks in advance for any help!

I have the same problem. Have you found any solution for this issue?