Nvidia-smi and Python
前言
在之前的nvidia-smi 详解(一)写的过程中,查资料查到了有对应的python支持方法,就计划写这个了,随后写加查资料就写好代码了,但是就是犯懒一直没写文章,墨迹到了现在。
也是做了一些简单的接口测试以及使用,主要还是查询的接口,没测试控制类接口。
简介
对应的py库介绍主要是来自: nvidia-ml-py。
代码语言:txt复制Provides a Python interface to GPU management and monitoring functions.
This is a wrapper around the NVML library. For information about the NVML library, see the NVML developer page http://developer.nvidia.com/nvidia-management-library-nvml
Download the latest package from: http://pypi.python.org/pypi/nvidia-ml-py/
Note this file can be run with 'python -m doctest -v README.txt' although the results are system dependent
The nvml header file contains function documentation that is relevant to this wrapper. The header file is distributed with. https://developer.nvidia.com/gpu-deployment-kit
The main difference is this library handles allocating structs and passing pointers to the functions, before returning the desired value. Non-success return codes are raised as exceptions as described in the section below.
安装
pip安装
代码语言:shell复制python3 -m pip install nvidia-ml-py
安装包安装
下载地址:nvidia-ml-py 12.555.43
代码语言:shell复制wget https://files.pythonhosted.org/packages/ee/c6/2348fc1fb776ff41a34635fb1f18010a6d6fd7ba6e57184dabd9d98ba9cf/nvidia-ml-py-12.555.43.tar.gz
tar -xzf nvidia-ml-py-$major-$minor-$patch.tar.gz`
cd nvidia-ml-py-$major-$minor-$patch
sudo python setup.py install
注意:$major-$minor-$patch 这三个变量的替换
代码
导入
代码语言:python代码运行次数:0复制from pynvml import *
初始化
代码语言:python代码运行次数:0复制nvmlInit()
版本查询
代码语言:python代码运行次数:0复制print(f"Driver Version: {nvmlSystemGetDriverVersion()}")
输出结果: Driver Version: 552.22
GPU个数获取
代码语言:python代码运行次数:0复制deviceCount = nvmlDeviceGetCount()
获取对应索引GPU句柄
代码语言:python代码运行次数:0复制handle = nvmlDeviceGetHandleByIndex(i)
print(f"Device {i} : {nvmlDeviceGetName(handle)}")
输出结果:
代码语言:shell复制Device 0 : NVIDIA GeForce RTX 4060 Ti
获取显存信息
代码语言:python代码运行次数:0复制info = nvmlDeviceGetMemoryInfo(handle)
print(f"Total memory: {info.total} MiB")
print(f"Free memory: {info.free} MiB")
print(f"Used memory: {info.used} MiB")
输出结果:
代码语言:python代码运行次数:0复制Total memory: 8585740288 MiB
Free memory: 6701080576 MiB
Used memory: 1884659712 MiB
获取显存利用率
代码语言:python代码运行次数:0复制info = nvmlDeviceGetUtilizationRates(handle)
print(f"UtilizationRates Gpu: {info.gpu}%")
print(f"UtilizationRates Memory: {info.memory}%")
输出结果:
代码语言:python代码运行次数:0复制UtilizationRates Gpu: 1%
UtilizationRates Memory: 17%
获取编码利用率信息
代码语言:python代码运行次数:0复制info = nvmlDeviceGetEncoderUtilization(handle)
print(f"EncoderUtilization {info}")
输出结果:
代码语言:python代码运行次数:0复制EncoderUtilization [0, 200000]
获取解码利用率信息
代码语言:python代码运行次数:0复制info = nvmlDeviceGetDecoderUtilization(handle)
print(f"DecoderUtilization {info}")
输出结果:
代码语言:shell复制DecoderUtilization [0, 200000]
获取JPG利用信息
代码语言:python代码运行次数:0复制info = nvmlDeviceGetJpgUtilization(handle)
print(f"JpgUtilization {info}")
输出结果:
代码语言:python代码运行次数:0复制JpgUtilization [0, 200000]
通过Pid获取进程名字
代码语言:python代码运行次数:0复制def get_process_name_by_pid(pid):
try:
return psutil.Process(pid).name()
except:
return None
获取正在运行计算操作的进程列表
代码语言:shell复制info = nvmlDeviceGetComputeRunningProcesses(handle)
for index,item in enumerate(info):
print(f"{index} : {item} ,name: {get_process_name_by_pid(item.pid)}")
输出信息(只放部分进程):
代码语言:shell复制0 : {'pid': 1716, 'usedGpuMemory': None, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295} ,name: dwm.exe
1 : {'pid': 6384, 'usedGpuMemory': None, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295} ,name: explorer.exe
2 : {'pid': 6592, 'usedGpuMemory': None, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295} ,name: SearchHost.exe
3 : {'pid': 7064, 'usedGpuMemory': None, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295} ,name: StartMenuExperienceHost.exe
获取正在运行的图形进程列表
代码语言:python代码运行次数:0复制info = nvmlDeviceGetGraphicsRunningProcesses(handle)
for index,item in enumerate(info):
print(f"{index} : {item} ,name: {get_process_name_by_pid(item.pid)}")
输出信息(只放部分输出结果);
代码语言:shell复制0 : {'pid': 1716, 'usedGpuMemory': None, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295} ,name: dwm.exe
1 : {'pid': 6384, 'usedGpuMemory': None, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295} ,name: explorer.exe
2 : {'pid': 6592, 'usedGpuMemory': None, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295} ,name: SearchHost.exe
3 : {'pid': 7064, 'usedGpuMemory': None, 'gpuInstanceId': 4294967295, 'computeInstanceId': 4294967295} ,name: StartMenuExperienceHost.exe
获取GPU的当前温度
代码语言:python代码运行次数:0复制temperature_gpu = nvmlDeviceGetTemperature(handle, NVML_TEMPERATURE_GPU)
print(f"temperature_gpu {temperature_gpu}")
输出结果:
代码语言:shell复制temperature_gpu 29
获取GPU的上限温度
代码语言:python代码运行次数:0复制temperature_threshold = nvmlDeviceGetTemperatureThreshold(handle, NVML_TEMPERATURE_THRESHOLD_GPU_MAX)
print(f"temperature_threshold {temperature_threshold}")
输出结果:
代码语言:shell复制temperature_threshold 90
获取GPU设备的UUID
代码语言:python代码运行次数:0复制gpu_uuid = nvmlDeviceGetUUID(handle)
print(f"gpu_uuid {gpu_uuid}")
输出结果:
代码语言:shell复制gpu_uuid GPU-3fd9292f-3024-fbdb-4596-5c5560b91654
获取GPU设备的PCI信息
代码语言:python代码运行次数:0复制pci_info = nvmlDeviceGetPciInfo(handle)
print(f"pci info {pci_info}")
获取的其他信息 CLOCK相关
代码语言:python代码运行次数:0复制print(f"GRAPHICS clock info {nvmlDeviceGetClockInfo(handle, NVML_CLOCK_GRAPHICS)}")
print(f"SM clock info {nvmlDeviceGetClockInfo(handle, NVML_CLOCK_SM)}")
print(f"MEM clock info {nvmlDeviceGetClockInfo(handle, NVML_CLOCK_MEM)}")
print(f"VIDEO clock info {nvmlDeviceGetClockInfo(handle, NVML_CLOCK_VIDEO)}")
输出结果:
代码语言:shell复制GRAPHICS clock info 270
SM clock info 270
MEM clock info 405
VIDEO clock info 1185
获取电源相关信息
代码语言:python代码运行次数:0复制print(f"power state {nvmlDeviceGetPowerState(handle)}")
print(f"power usage watts {nvmlDeviceGetPowerUsage(handle) / 1000}")
输出结果:
代码语言:shell复制power state 8
power usage percent 27.649
结束调用
代码语言:python代码运行次数:0复制nvmlShutdown()
全部代码
主要是遍历输出所有Nvidia状态信息,可以根据自己需要来进行编写设置。
代码语言:python代码运行次数:0复制from pynvml import *
import psutil
def get_process_name_by_pid(pid):
try:
return psutil.Process(pid).name()
except:
return None
nvmlInit()
print(f"Driver Version: {nvmlSystemGetDriverVersion()}")
deviceCount = nvmlDeviceGetCount()
try:
for i in range(deviceCount):
handle = nvmlDeviceGetHandleByIndex(i)
print(f"Device {i} : {nvmlDeviceGetName(handle)}")
print(" ")
try:
(current, pending) = nvmlDeviceGetEccMode(handle)
print("current",current,"pending",pending)
except NVMLError as error:
print(error)
print(" ")
info = nvmlDeviceGetMemoryInfo(handle)
print(f"Total memory: {info.total} MiB")
print(f"Free memory: {info.free} MiB")
print(f"Used memory: {info.used} MiB")
print(" ")
info = nvmlDeviceGetUtilizationRates(handle)
print(f"UtilizationRates Gpu: {info.gpu}%")
print(f"UtilizationRates Memory: {info.memory}%")
print(" ")
info = nvmlDeviceGetEncoderUtilization(handle)
print(f"EncoderUtilization {info}")
print(" ")
info = nvmlDeviceGetDecoderUtilization(handle)
print(f"DecoderUtilization {info}")
print(" ")
info = nvmlDeviceGetJpgUtilization(handle)
print(f"JpgUtilization {info}")
print(" ")
info = nvmlDeviceGetOfaUtilization(handle)
print(f"OfaUtilization {info}")
print(" ")
info = nvmlDeviceGetComputeRunningProcesses(handle)
for index,item in enumerate(info):
print(f"{index} : {item} ,name: {get_process_name_by_pid(item.pid)}")
print(" ")
info = nvmlDeviceGetGraphicsRunningProcesses(handle)
for index,item in enumerate(info):
print(f"{index} : {item} ,name: {get_process_name_by_pid(item.pid)}")
print(" ")
try:
# NVML_PROCESS_MODE_COMPUTE | NVML_PROCESS_MODE_GRAPHICS | NVML_PROCESS_MODE_MPS
info = nvmlDeviceGetRunningProcessDetailList(handle, nvmlProcessDetailList_v1, NVML_PROCESS_MODE_COMPUTE)
for item in info:
print(f"ProcessDetailList {item}")
except NVMLError as error:
print(f"nvmlDeviceGetRunningProcessDetailList {error}")
temperature_gpu = nvmlDeviceGetTemperature(handle, NVML_TEMPERATURE_GPU)
print(f"temperature_gpu {temperature_gpu}")
temperature_threshold = nvmlDeviceGetTemperatureThreshold(handle, NVML_TEMPERATURE_THRESHOLD_GPU_MAX)
print(f"temperature_threshold {temperature_threshold}")
print("")
gpu_uuid = nvmlDeviceGetUUID(handle)
print(f"gpu_uuid {gpu_uuid}")
print("")
pci_info = nvmlDeviceGetPciInfo(handle)
print(f"pci info {pci_info}")
print("")
print(f"GRAPHICS clock info {nvmlDeviceGetClockInfo(handle, NVML_CLOCK_GRAPHICS)}")
print(f"SM clock info {nvmlDeviceGetClockInfo(handle, NVML_CLOCK_SM)}")
print(f"MEM clock info {nvmlDeviceGetClockInfo(handle, NVML_CLOCK_MEM)}")
print(f"VIDEO clock info {nvmlDeviceGetClockInfo(handle, NVML_CLOCK_VIDEO)}")
print("")
print(f"power state {nvmlDeviceGetPowerState(handle)}")
print(f"power usage watts {nvmlDeviceGetPowerUsage(handle) / 1000}")
except NVMLError as error:
print(error)
nvmlShutdown()
结束语
其实写的比较简单,只是摘取了我觉得日常会用的一些信息,还有很多其他函数,可以在pynvml.py
文件里边查看获取。或者 NVIDIA GPU等网站查看自己需要的一些信息,样例等。这半年其实一直没怎写,日常工作也比较无聊,开始研究srs,以及一些pyton脚本的编写,希望把文章抓起来,多写一些。