Advanced Use - question

dast04 · August 1, 2024, 3:12pm

I have created custom exporter that scrapes information about gpu usage across nodes in kubernetes cluster.
The thing is that some nodes are 1 gpu nodes, and others are 2 or 4.

Here is an example of dashboard

gpu_metrics{cluster='$cluster', pod_name="$pod_name"}

I use the following transforms:

Organize fields by name
Group by

Goal:
I am wondering if instead of creating 1 table for all 4 of them, I can somehow dynamically provision 4 separate visualizations, one per GPU (each gpu has unique series which is used as identifier).

The reason why I am looking for dybamic provisioning is so that in case of node with 1 GPU, only 1 visualization is created, and in case of 4, 4.

Any suggestions are welcome

yosiasz · August 1, 2024, 3:41pm

can you create a variable for distinct list of GPUs?

fadjar340 · August 2, 2024, 10:19am

Hi,

I think gpu metrics is the same principles as process in the OS, there’s a URL that maybe give you insight…
https://devconnected.com/monitoring-linux-processes-using-prometheus-and-grafana/

That URL just use small bash script to get the ps aux command in Linux.
I think the GPU also can query by the nvidia tools like nvidia-smi.

I believe that the script will overhaul as your need to get the gpu metrics.
I tried nvidia-smi as follow:

nvidia-smi --query-gpu=timestamp,name,index,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv

And the result as follow:
2024/08/02 17:16:39.312, NVIDIA GeForce GTX 1050, 0, 0 %, 0 %, 4096 MiB, 3685 MiB, 355 MiB

You can remove the timestamp and removed by prometheus scraping time, also you can run the nvidia-smi as your need to get information that you want.

Using csv result, then you can parsing the result using the script from the above URL, then the bash script can be running using systemd.

You can modify the labels and value as you want from that script.

Regards,
Fadjar

Topic		Replies	Views
Custom grafana dashboard by node Prometheus	2	983	November 21, 2017
Multiple metrics for multiple servers in single panel /gadget Prometheus	0	893	January 5, 2021
Use multi exporter at Grafana Prometheus query Dashboards templating , query-help	1	719	September 27, 2022
Create high amount of panels at once Grafana	1	430	January 21, 2019
Simple Gauge for CPU usage stops working when using "multi-value" option (node_exporter) Gauge Panel templating , prometheus , node-exporter , dashboard-variables	1	961	February 13, 2023

Advanced Use - question

Related topics