I am trying out Grafana cloud and added the kubernetes integration. Then I deployed the agent and the metrics server as per instructions in the integration page. I am able to view the kubelet dashboards but can’t see data in any of the cluster dashboards.
The grafana cloud version is 8.5.0. I am using defaults for everything including dashboards and agent config.
My grafana config contains the following scrape config:
scrape_configs:
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: integrations/kubernetes/cadvisor
kubernetes_sd_configs:
- role: node
metric_relabel_configs:
- source_labels: [__name__]
regex: storage_operation_duration_seconds_bucket|kubelet_pleg_relist_interval_seconds_bucket|kube_horizontalpodautoscaler_spec_min_replicas|kube_daemonset_updated_number_scheduled|kube_pod_owner|kube_node_spec_taint|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|namespace_workload_pod|kubelet_volume_stats_inodes|kubelet_running_containers|kubelet_pod_worker_duration_seconds_bucket|kube_statefulset_status_replicas_ready|kube_resourcequota|kubelet_node_name|kube_statefulset_status_observed_generation|node_namespace_pod_container:container_memory_swap|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kube_node_status_capacity|container_memory_cache|kube_statefulset_replicas|kube_deployment_status_replicas_available|go_goroutines|kubelet_pleg_relist_duration_seconds_bucket|kube_job_status_succeeded|kube_pod_info|kubelet_pleg_relist_duration_seconds_count|kubelet_cgroup_manager_duration_seconds_bucket|kube_daemonset_status_desired_number_scheduled|node_namespace_pod_container:container_memory_working_set_bytes|kube_node_status_condition|kube_daemonset_status_number_available|kubelet_volume_stats_capacity_bytes|kube_horizontalpodautoscaler_spec_max_replicas|container_memory_rss|container_network_receive_packets_total|node_namespace_pod_container:container_memory_rss|kube_deployment_metadata_generation|container_fs_writes_total|kube_node_status_allocatable|kube_pod_container_status_waiting_reason|kube_pod_status_phase|container_fs_reads_bytes_total|storage_operation_errors_total|kube_statefulset_status_update_revision|container_network_transmit_bytes_total|container_network_transmit_packets_total|up|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|container_cpu_usage_seconds_total|kube_node_info|container_network_receive_packets_dropped_total|container_memory_working_set_bytes|kubelet_running_pods|kubelet_pod_start_duration_seconds_count|kube_statefulset_metadata_generation|kube_deployment_status_observed_generation|container_network_receive_bytes_total|process_resident_memory_bytes|kubelet_running_container_count|kubelet_volume_stats_available_bytes|kubelet_cgroup_manager_duration_seconds_count|kube_horizontalpodautoscaler_status_current_replicas|kube_daemonset_status_number_misscheduled|kube_pod_container_resource_requests|rest_client_requests_total|kubelet_server_expiration_renew_errors|machine_memory_bytes|kubelet_runtime_operations_duration_seconds_bucket|kube_job_failed|kube_daemonset_status_current_number_scheduled|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|kubelet_pod_worker_duration_seconds_count|kubernetes_build_info|kubelet_volume_stats_inodes_used|namespace_workload_pod:kube_pod_owner:relabel|namespace_memory:kube_pod_container_resource_requests:sum|storage_operation_duration_seconds_count|kube_statefulset_status_replicas|container_memory_swap|volume_manager_total_volumes|kubelet_node_config_error|kubelet_runtime_operations_total|kubelet_runtime_operations_errors_total|kube_deployment_spec_replicas|container_network_transmit_packets_dropped_total|process_cpu_seconds_total|kube_deployment_status_replicas_updated|node_namespace_pod_container:container_memory_cache|kubelet_running_pod_count|namespace_memory:kube_pod_container_resource_limits:sum|kube_statefulset_status_replicas_updated|kubelet_certificate_manager_client_expiration_renew_errors|kube_replicaset_owner|kube_pod_container_resource_limits|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|container_cpu_cfs_throttled_periods_total|namespace_cpu:kube_pod_container_resource_requests:sum|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|rest_client_request_duration_seconds_bucket|kube_horizontalpodautoscaler_status_desired_replicas|namespace_cpu:kube_pod_container_resource_limits:sum|container_fs_reads_total|container_fs_writes_bytes_total|container_cpu_cfs_periods_total|kube_job_spec_completions|kube_statefulset_status_current_revision|kubelet_certificate_manager_server_ttl_seconds|kube_namespace_created|kubelet_certificate_manager_client_ttl_seconds
action: keep
relabel_configs:
- replacement: kubernetes.default.svc.cluster.local:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
server_name: kubernetes
It also contains the following dashboard query
targets:
- expr: sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{cluster="$cluster"})
But, the metrics from the cadvisor endpoint doesn’t seem to contain the label node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
. It contains container_cpu_usage_seconds_total
though. As an example
# HELP container_cpu_usage_seconds_total Cumulative cpu time consumed in seconds.
# TYPE container_cpu_usage_seconds_total counter
container_cpu_usage_seconds_total{container="",cpu="total",id="/",image="",name="",namespace="",pod=""} 734674.400339535 1650908621828
container_cpu_usage_seconds_total{container="",cpu="total",id="/kubepods",image="",name="",namespace="",pod=""} 143119.498904478 1650908621862
container_cpu_usage_seconds_total{container="",cpu="total",id="/kubepods/besteffort",image="",name="",namespace="",pod=""} 112884.722452293 1650908622310
I believe that updating the dashboard queries with container_cpu_usage_seconds_total
would fix this but I am not sure if that is the correct solution to this issue. I don’t know if this issue is due to the cadvisor
itself, the grafana config file or the dashboards and I would like to make the correct fix.
Any help is appreciated.