Advanced Use - question

I have created custom exporter that scrapes information about gpu usage across nodes in kubernetes cluster.
The thing is that some nodes are 1 gpu nodes, and others are 2 or 4.

Here is an example of dashboard

gpu_metrics{cluster='$cluster', pod_name="$pod_name"}

I use the following transforms:

  1. Organize fields by name
  2. Group by

Goal:
I am wondering if instead of creating 1 table for all 4 of them, I can somehow dynamically provision 4 separate visualizations, one per GPU (each gpu has unique series which is used as identifier).

The reason why I am looking for dybamic provisioning is so that in case of node with 1 GPU, only 1 visualization is created, and in case of 4, 4.

Any suggestions are welcome :slight_smile:

can you create a variable for distinct list of GPUs?

Hi,

I think gpu metrics is the same principles as process in the OS, there’s a URL that maybe give you insight…
https://devconnected.com/monitoring-linux-processes-using-prometheus-and-grafana/

That URL just use small bash script to get the ps aux command in Linux.
I think the GPU also can query by the nvidia tools like nvidia-smi.

I believe that the script will overhaul as your need to get the gpu metrics.
I tried nvidia-smi as follow:

nvidia-smi --query-gpu=timestamp,name,index,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv

And the result as follow:
2024/08/02 17:16:39.312, NVIDIA GeForce GTX 1050, 0, 0 %, 0 %, 4096 MiB, 3685 MiB, 355 MiB

You can remove the timestamp and removed by prometheus scraping time, also you can run the nvidia-smi as your need to get information that you want.

Using csv result, then you can parsing the result using the script from the above URL, then the bash script can be running using systemd.

You can modify the labels and value as you want from that script.

Regards,
Fadjar