Cluster metrics dashboards are not working with AKS clusters

I am trying out Grafana cloud and added the kubernetes integration. Then I deployed the agent and the metrics server as per instructions in the integration page. I am able to view the kubelet dashboards but can’t see data in any of the cluster dashboards.

The grafana cloud version is 8.5.0. I am using defaults for everything including dashboards and agent config.

My grafana config contains the following scrape config:

scrape_configs:
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  job_name: integrations/kubernetes/cadvisor
  kubernetes_sd_configs:
    - role: node
  metric_relabel_configs:
    - source_labels: [__name__]
      regex: storage_operation_duration_seconds_bucket|kubelet_pleg_relist_interval_seconds_bucket|kube_horizontalpodautoscaler_spec_min_replicas|kube_daemonset_updated_number_scheduled|kube_pod_owner|kube_node_spec_taint|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|namespace_workload_pod|kubelet_volume_stats_inodes|kubelet_running_containers|kubelet_pod_worker_duration_seconds_bucket|kube_statefulset_status_replicas_ready|kube_resourcequota|kubelet_node_name|kube_statefulset_status_observed_generation|node_namespace_pod_container:container_memory_swap|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kube_node_status_capacity|container_memory_cache|kube_statefulset_replicas|kube_deployment_status_replicas_available|go_goroutines|kubelet_pleg_relist_duration_seconds_bucket|kube_job_status_succeeded|kube_pod_info|kubelet_pleg_relist_duration_seconds_count|kubelet_cgroup_manager_duration_seconds_bucket|kube_daemonset_status_desired_number_scheduled|node_namespace_pod_container:container_memory_working_set_bytes|kube_node_status_condition|kube_daemonset_status_number_available|kubelet_volume_stats_capacity_bytes|kube_horizontalpodautoscaler_spec_max_replicas|container_memory_rss|container_network_receive_packets_total|node_namespace_pod_container:container_memory_rss|kube_deployment_metadata_generation|container_fs_writes_total|kube_node_status_allocatable|kube_pod_container_status_waiting_reason|kube_pod_status_phase|container_fs_reads_bytes_total|storage_operation_errors_total|kube_statefulset_status_update_revision|container_network_transmit_bytes_total|container_network_transmit_packets_total|up|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|container_cpu_usage_seconds_total|kube_node_info|container_network_receive_packets_dropped_total|container_memory_working_set_bytes|kubelet_running_pods|kubelet_pod_start_duration_seconds_count|kube_statefulset_metadata_generation|kube_deployment_status_observed_generation|container_network_receive_bytes_total|process_resident_memory_bytes|kubelet_running_container_count|kubelet_volume_stats_available_bytes|kubelet_cgroup_manager_duration_seconds_count|kube_horizontalpodautoscaler_status_current_replicas|kube_daemonset_status_number_misscheduled|kube_pod_container_resource_requests|rest_client_requests_total|kubelet_server_expiration_renew_errors|machine_memory_bytes|kubelet_runtime_operations_duration_seconds_bucket|kube_job_failed|kube_daemonset_status_current_number_scheduled|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kubelet_pod_worker_duration_seconds_count|kubernetes_build_info|kubelet_volume_stats_inodes_used|namespace_workload_pod:kube_pod_owner:relabel|namespace_memory:kube_pod_container_resource_requests:sum|storage_operation_duration_seconds_count|kube_statefulset_status_replicas|container_memory_swap|volume_manager_total_volumes|kubelet_node_config_error|kubelet_runtime_operations_total|kubelet_runtime_operations_errors_total|kube_deployment_spec_replicas|container_network_transmit_packets_dropped_total|process_cpu_seconds_total|kube_deployment_status_replicas_updated|node_namespace_pod_container:container_memory_cache|kubelet_running_pod_count|namespace_memory:kube_pod_container_resource_limits:sum|kube_statefulset_status_replicas_updated|kubelet_certificate_manager_client_expiration_renew_errors|kube_replicaset_owner|kube_pod_container_resource_limits|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|container_cpu_cfs_throttled_periods_total|namespace_cpu:kube_pod_container_resource_requests:sum|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|rest_client_request_duration_seconds_bucket|kube_horizontalpodautoscaler_status_desired_replicas|namespace_cpu:kube_pod_container_resource_limits:sum|container_fs_reads_total|container_fs_writes_bytes_total|container_cpu_cfs_periods_total|kube_job_spec_completions|kube_statefulset_status_current_revision|kubelet_certificate_manager_server_ttl_seconds|kube_namespace_created|kubelet_certificate_manager_client_ttl_seconds
      action: keep
  relabel_configs:
    - replacement: kubernetes.default.svc.cluster.local:443
      target_label: __address__
    - regex: (.+)
      replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
      source_labels:
        - __meta_kubernetes_node_name
      target_label: __metrics_path__
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: false
    server_name: kubernetes

It also contains the following dashboard query

targets:
- expr: sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{cluster="$cluster"})

But, the metrics from the cadvisor endpoint doesn’t seem to contain the label node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate. It contains container_cpu_usage_seconds_total though. As an example

# HELP container_cpu_usage_seconds_total Cumulative cpu time consumed in seconds.
# TYPE container_cpu_usage_seconds_total counter
container_cpu_usage_seconds_total{container="",cpu="total",id="/",image="",name="",namespace="",pod=""} 734674.400339535 1650908621828
container_cpu_usage_seconds_total{container="",cpu="total",id="/kubepods",image="",name="",namespace="",pod=""} 143119.498904478 1650908621862
container_cpu_usage_seconds_total{container="",cpu="total",id="/kubepods/besteffort",image="",name="",namespace="",pod=""} 112884.722452293 1650908622310

I believe that updating the dashboard queries with container_cpu_usage_seconds_total would fix this but I am not sure if that is the correct solution to this issue. I don’t know if this issue is due to the cadvisor itself, the grafana config file or the dashboards and I would like to make the correct fix.

Any help is appreciated.

Hello! Are the pre-built panels returning the “No Data” response for the “Kube State Metrics?” If so, the issue might be related to a service name mismatch.

In the Agent ConfigMap, there is a section that scrapes kube state metrics with relabel_configs regex looking for ksm-kube-state-metrics. Specifically:

relabel_configs:
              - action: keep
                regex: ksm-kube-state-metrics
                source_labels:
                  - __meta_kubernetes_service_name

In this case, there are issues when the service is named kube-state-metrics, instead of ksm-kube-state-metrics. This is usually due to the kube state metrics being deployed by some means other than Helm.

You can try changing the service name in the Agent ConfigMap and test to see if this returns data. If that does not resolve the issue, I recommend opening a ticket with the Support team and reference this post so they can take a closer look!

1 Like

To add to Melody’s great answer, the node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate metric is generated by a recording rule. You can find these in Alerting → Alert Rules → k8s.rules. You should see the metric and the recording rule / query that generates it. If you run that query and don’t get any data back, it means one or more of the referenced metrics don’t have data.

Since you can see container_cpu_usage_seconds_total, it’s likely an issue with the kube_pod_info metric, generated by kube-state-metrics. As Melody mentioned, would make sure kube-state-metrics has been deployed correctly and its corresponding Agent scrape config is configured appropriately to scrape kube-state-metrics.

When you’re able to query kube_pod_info from Grafana/Explore, everything should begin working correctly.

2 Likes

Oh, that was it! And here I was thinking that the cadvisor was improperly configured to send the correct metrics. I had used the helm chart for kube-state-metrics but gave my own release name (helm install grafana-cloud prometheus-community/kube-state-metrics instead of helm install ksm prometheus-community/kube-state-metrics). Didn’t realise that this could lead to the problem.

I believe that this should be put as a footnote in the kubernetes integration configuration details page (the page which comes after you click on the kubernetes integration in the Grafana Integrations dashboard). I don’t think that the docs are open source else I would have contributed that.

EDIT: I am beginning to doubt myself :face_with_spiral_eyes:. I was pretty sure that the documentation in the integration dashboard had regex: ksm-kube-state-metrics but now that I am rechecking it, its regex: kube-state-metrics. Did I misread it earlier? That couldn’t be cause you folks figured out exactly what was wrong. My brain hurts…

EDIT2: Ah! now I see. the docs were updated so that instead of the service name, a label is used instead of refer to the service. Hence the change. That’s a nice change!!

1 Like

Apologies for the confusion — you nailed it exactly!

We pushed an update to the integration’s setup instructions (likely in between the time you initially set things up and revisited them) that changed the scrape config to use a different selector — it doesn’t look at the release name anymore (your issue above was quite common).

2 Likes