nvidia-smi 详解(一)

2024-01-17 11:13:26 浏览数 (2)

前言

nvidia-smi 全称 NVIDIA System Management Interface ,顾名思义:英伟达系统管理接口。nvidia-smi一般大家只是简单使用用来查询英伟达系列显卡的使用情况/工作情况(显存使用量/显卡利用率/显卡工作线程)等。如下图所示:

代码语言:bash复制
nvidia-smi.exe
Tue Jan 16 22:43:00 2024
 --------------------------------------------------------------------------------------- 
| NVIDIA-SMI 537.70                 Driver Version: 537.70       CUDA Version: 12.2     |
|----------------------------------------- ---------------------- ---------------------- 
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|========================================= ====================== ======================|
|   0  NVIDIA GeForce RTX 4060 Ti   WDDM  | 00000000:01:00.0  On |                  N/A |
| 32%   30C    P8               7W / 160W |    990MiB /  8188MiB |     27%      Default |
|                                         |                      |                  N/A |
 ----------------------------------------- ---------------------- ---------------------- 

 --------------------------------------------------------------------------------------- 
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3200    C G   ...ekyb3d8bbweStoreExperienceHost.exe    N/A      |

我在日常使用中也只是一个是简单的查询这个命令是否可用,用来判断显卡驱动是否安装成功,另一个就是刚刚说的查询显卡的使用情况,在上周的时候和监控团队沟通部署显卡监控的时候,在配合他们工作的时候,察觉自己对nvidia-smi的使用了解太浅了,也去学习了一下,做了一翻了解,学习了一些东西,不过意外的发现,在国内互联网上对 nvidia-smi介绍大多都很简单,于是萌生了写一篇文章的想法。

简介

nvidia-smi 的安装就不做说明了,开始对nvidia-smi做介绍。按照国际惯例在shell敲下 nvidia-smi -h,来查看帮助文档。

代码语言:powe复制
nvidia-smi -h
NVIDIA System Management Interface -- v546.33

NVSMI provides monitoring information for Tesla and select Quadro devices.
The data is presented in either a plain text or an XML format, via stdout or a file.
NVSMI also provides several management operations for changing the device state.

Note that the functionality of NVSMI is exposed through the NVML C-based
library. See the NVIDIA developer website for more information about NVML.
Python wrappers to NVML are also available.  The output of NVSMI is
not guaranteed to be backwards compatible; NVML and the bindings are backwards
compatible.

http://developer.nvidia.com/nvidia-management-library-nvml/
http://pypi.python.org/pypi/nvidia-ml-py/
Supported products:
- Full Support
    - All Tesla products, starting with the Kepler architecture
    - All Quadro products, starting with the Kepler architecture
    - All GRID products, starting with the Kepler architecture
    - GeForce Titan products, starting with the Kepler architecture
- Limited Support
    - All Geforce products, starting with the Kepler architecture

上述的释义是:

nvsmi提供了对于 Teslaselect Quadro devices 的监控信息。这些监控信息数据可以通过 纯文本 或者 xml格式 来展示,以标准输出或者文件。还提供了一些用于更改设备状态的操作。后续的就是一些关于NVSMINVML的一些简要说明,需要注意的是那两个链接,尤其第二个提供了关于python的命令介绍。以及后续supported products

到这里的描述的都是一些关于nvidia-smi的描述,之后的都是关于的它的使用方法。命令行相关的一些。

代码语言:shell复制
nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] ...

    -h,   --help                Print usage information and exit.

  LIST OPTIONS:

    -L,   --list-gpus           Display a list of GPUs connected to the system.

    -B,   --list-excluded-gpus  Display a list of excluded GPUs in the system.

  SUMMARY OPTIONS:

    <no arguments>              Show a summary of GPUs connected to the system.

    [plus any of]

    -i,   --id=                 Target a specific GPU.
    -f,   --filename=           Log to a specified file, rather than to stdout.
    -l,   --loop=               Probe until Ctrl C at specified second interval.

  QUERY OPTIONS:

    -q,   --query               Display GPU or Unit info.

    [plus any of]

    -u,   --unit                Show unit, rather than GPU, attributes.
    -i,   --id=                 Target a specific GPU or Unit.
    -f,   --filename=           Log to a specified file, rather than to stdout.
    -x,   --xml-format          Produce XML output.
          --dtd                 When showing xml output, embed DTD.
    -d,   --display=            Display only selected information: MEMORY,
                                    UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK,
                                    COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS,
                                    PAGE_RETIREMENT, ACCOUNTING, ENCODER_STATS,
                                    SUPPORTED_GPU_TARGET_TEMP, VOLTAGE, FBC_STATS
                                    ROW_REMAPPER, RESET_STATUS
                                Flags can be combined with comma e.g. ECC,POWER.
                                Sampling data with max/min/avg is also returned
                                for POWER, UTILIZATION and CLOCK display types.
                                Doesn't work with -u or -x flags.
    -l,   --loop=               Probe until Ctrl C at specified second interval.

    -lms, --loop-ms=            Probe until Ctrl C at specified millisecond interval.

  SELECTIVE QUERY OPTIONS:

    Allows the caller to pass an explicit list of properties to query.

    [one of]

    --query-gpu                 Information about GPU.
                                Call --help-query-gpu for more info.
    --query-supported-clocks    List of supported clocks.
                                Call --help-query-supported-clocks for more info.
    --query-compute-apps        List of currently active compute processes.
                                Call --help-query-compute-apps for more info.
    --query-accounted-apps      List of accounted compute processes.
                                Call --help-query-accounted-apps for more info.
                                This query is not supported on vGPU host.
    --query-retired-pages       List of device memory pages that have been retired.
                                Call --help-query-retired-pages for more info.
    --query-remapped-rows       Information about remapped rows.
                                Call --help-query-remapped-rows for more info.

    [mandatory]

    --format=                   Comma separated list of format options:
                                  csv - comma separated values (MANDATORY)
                                  noheader - skip the first line with column headers
                                  nounits - don't print units for numerical
                                             values

    [plus any of]

    -i,   --id=                 Target a specific GPU or Unit.
    -f,   --filename=           Log to a specified file, rather than to stdout.
    -l,   --loop=               Probe until Ctrl C at specified second interval.
    -lms, --loop-ms=            Probe until Ctrl C at specified millisecond interval.

  DEVICE MODIFICATION OPTIONS:

    [any one of]

    -e,   --ecc-config=         Toggle ECC support: 0/DISABLED, 1/ENABLED
    -p,   --reset-ecc-errors=   Reset ECC error counts: 0/VOLATILE, 1/AGGREGATE
    -c,   --compute-mode=       Set MODE for compute applications:
                                0/DEFAULT, 1/EXCLUSIVE_THREAD (DEPRECATED),
                                2/PROHIBITED, 3/EXCLUSIVE_PROCESS
    -dm,  --driver-model=       Enable or disable TCC mode: 0/WDDM, 1/TCC
    -fdm, --force-driver-model= Enable or disable TCC mode: 0/WDDM, 1/TCC
                                Ignores the error that display is connected.
          --gom=                Set GPU Operation Mode:
                                    0/ALL_ON, 1/COMPUTE, 2/LOW_DP
    -lgc  --lock-gpu-clocks=    Specifies <minGpuClock,maxGpuClock> clocks as a
                                    pair (e.g. 1500,1500) that defines the range
                                    of desired locked GPU clock speed in MHz.
                                    Setting this will supercede application clocks
                                    and take effect regardless if an app is running.
                                    Input can also be a singular desired clock value
                                    (e.g. <GpuClockValue>). Optionally, --mode can be
                                    specified to indicate a special mode.
    -m    --mode=               Specifies the mode for --locked-gpu-clocks.
                                    Valid modes: 0, 1
    -rgc  --reset-gpu-clocks
                                Resets the Gpu clocks to the default values.
    -lmc  --lock-memory-clocks=  Specifies <minMemClock,maxMemClock> clocks as a
                                    pair (e.g. 5100,5100) that defines the range
                                    of desired locked Memory clock speed in MHz.
                                    Input can also be a singular desired clock value
                                    (e.g. <MemClockValue>).
    -rmc  --reset-memory-clocks
                                Resets the Memory clocks to the default values.
    -lmcd --lock-memory-clocks-deferred=
                                    Specifies memClock clock to lock. This limit is
                                    applied the next time GPU is initialized.
                                    This is guaranteed by unloading and reloading the kernel module.
                                    Requires root.
    -rmcd --reset-memory-clocks-deferred
                                Resets the deferred Memory clocks applied.
    -ac   --applications-clocks= Specifies <memory,graphics> clocks as a
                                    pair (e.g. 2000,800) that defines GPU's
                                    speed in MHz while running applications on a GPU.
    -rac  --reset-applications-clocks
                                Resets the applications clocks to the default values.
    -pl   --power-limit=        Specifies maximum power management limit in watts.
                                Takes an optional argument --scope.
    -sc   --scope=              Specifies the device type for --scope: 0/GPU, 1/TOTAL_MODULE (Grace Hopper Only)
    -cc   --cuda-clocks=        Overrides or restores default CUDA clocks.
                                In override mode, GPU clocks higher frequencies when running CUDA applications.
                                Only on supported devices starting from the Volta series.
                                Requires administrator privileges.
                                0/RESTORE_DEFAULT, 1/OVERRIDE
    -am   --accounting-mode=    Enable or disable Accounting Mode: 0/DISABLED, 1/ENABLED
    -caa  --clear-accounted-apps
                                Clears all the accounted PIDs in the buffer.
          --auto-boost-default= Set the default auto boost policy to 0/DISABLED
                                or 1/ENABLED, enforcing the change only after the
                                last boost client has exited.
          --auto-boost-permission=
                                Allow non-admin/root control over auto boost mode:
                                0/UNRESTRICTED, 1/RESTRICTED
    -mig  --multi-instance-gpu= Enable or disable Multi Instance GPU: 0/DISABLED, 1/ENABLED
                                Requires root.
    -gtt  --gpu-target-temp=    Set GPU Target Temperature for a GPU in degree celsius.
                                Requires administrator privileges

   [plus optional]

    -i,   --id=                 Target a specific GPU.
    -eow, --error-on-warning    Return a non-zero error for warnings.

  UNIT MODIFICATION OPTIONS:

    -t,   --toggle-led=         Set Unit LED state: 0/GREEN, 1/AMBER

   [plus optional]

    -i,   --id=                 Target a specific Unit.

  SHOW DTD OPTIONS:

          --dtd                 Print device DTD and exit.

     [plus optional]

    -f,   --filename=           Log to a specified file, rather than to stdout.
    -u,   --unit                Show unit, rather than device, DTD.

    --debug=                    Log encrypted debug information to a specified file.

 Device Monitoring:
    dmon                        Displays device stats in scrolling format.
                                "nvidia-smi dmon -h" for more information.

    daemon                      Runs in background and monitor devices as a daemon process.
                                This is an experimental feature. Not supported on Windows baremetal
                                "nvidia-smi daemon -h" for more information.

    replay                      Used to replay/extract the persistent stats generated by daemon.
                                This is an experimental feature.
                                "nvidia-smi replay -h" for more information.

 Process Monitoring:
    pmon                        Displays process stats in scrolling format.
                                "nvidia-smi pmon -h" for more information.

 NVLINK:
    nvlink                      Displays device nvlink information. "nvidia-smi nvlink -h" for more information.

 C2C:
    c2c                         Displays device C2C information. "nvidia-smi c2c -h" for more information.

 CLOCKS:
    clocks                      Control and query clock information. "nvidia-smi clocks -h" for more information.

 ENCODER SESSIONS:
    encodersessions             Displays device encoder sessions information. "nvidia-smi encodersessions -h" for more information.

 FBC SESSIONS:
    fbcsessions                 Displays device FBC sessions information. "nvidia-smi fbcsessions -h" for more information.

 MIG:
    mig                         Provides controls for MIG management. "nvidia-smi mig -h" for more information.

 COMPUTE POLICY:
    compute-policy              Control and query compute policies. "nvidia-smi compute-policy -h" for more information.


 BOOST SLIDER:
    boost-slider                Control and query boost sliders. "nvidia-smi boost-slider -h" for more information.

 POWER HINT:    power-hint                  Estimates GPU power usage. "nvidia-smi power-hint -h" for more information.

 BASE CLOCKS:    base-clocks                 Query GPU base clocks. "nvidia-smi base-clocks -h" for more information.

 GPU PERFORMANCE MONITORING:
    gpm                         Control and query GPU performance monitoring unit. "nvidia-smi gpm -h" for more information.

 PCI:
    pci                         Display device PCI information. "nvidia-smi pci -h" for more information.

Please see the nvidia-smi documentation for more detailed information.

其实大多数情况下,使用者主要是关注一些监控信息的输出。我摸索了两种方法,下边就简单说一说。

查询选项(QUERY OPTIONS)

在查询选项之前,也简单说一说上边的关键的一些。

LIST OPTIONS

代码语言:shell复制
-L,   --list-gpus           Display a list of GPUs connected to the system.

打印当前系统中 GPUS

执行结果如下:

代码语言:shell复制
nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 4060 Ti (UUID: GPU-XXXXX)

查询选项其实就是会显示所查询GPU或者所有GPU的关键属性。查询选项的参数如下:

代码语言:shell复制
  QUERY OPTIONS:

    -q,   --query               Display GPU or Unit info.

    [plus any of]

    -u,   --unit                Show unit, rather than GPU, attributes.
    -i,   --id=                 Target a specific GPU or Unit.
    -f,   --filename=           Log to a specified file, rather than to stdout.
    -x,   --xml-format          Produce XML output.
          --dtd                 When showing xml output, embed DTD.
    -d,   --display=            Display only selected information: MEMORY,
                                    UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK,
                                    COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS,
                                    PAGE_RETIREMENT, ACCOUNTING, ENCODER_STATS,
                                    SUPPORTED_GPU_TARGET_TEMP, VOLTAGE, FBC_STATS
                                    ROW_REMAPPER, RESET_STATUS
                                Flags can be combined with comma e.g. ECC,POWER.
                                Sampling data with max/min/avg is also returned
                                for POWER, UTILIZATION and CLOCK display types.
                                Doesn't work with -u or -x flags.
    -l,   --loop=               Probe until Ctrl C at specified second interval.

    -lms, --loop-ms=            Probe until Ctrl C at specified millisecond interval.

大致有几个内容

-q 打印所有信息 -f 输出到指定文件,没有则创建 -i 选择指定GPU,i为int类型,其中i的值是上边-L 输出的GPU i所对应的GPU -x 输出为xml文件 -d 选择输出属性的内容大概有17个属性,其中POWER, UTILIZATION and CLOCK不能输出为xml文件或者同-u一起输出 -l 循环输出 -lms 精确到ms输出

如果你要打印所有信息,命令如下:

代码语言:shell复制
nvidia-smi -q

如果你要输出到文件,命令如下:

代码语言:shell复制
nvidia-smi -q -f "D:test"

如果要输出为xml文件,命令如下:

代码语言:shell复制
nvidia-smi -q -x -f "D:test.xml"
#如果要输出为embed DTD
nvidia-smi -q -x --dtd -f "D:test.xml"

输出结果如下(简要摘抄一部分),

代码语言:xml复制
<?xml version="1.0" ?>
<!DOCTYPE nvidia_smi_log SYSTEM "nvsmi_device_v12.dtd">
<nvidia_smi_log>
	<timestamp>Tue Jan 16 21:23:03 2024</timestamp>
	<driver_version>537.70</driver_version>
	<cuda_version>12.2</cuda_version>
	<attached_gpus>1</attached_gpus>
	<gpu id="00000000:01:00.0">
		<product_name>NVIDIA GeForce RTX 4060 Ti</product_name>
		<product_brand>GeForce</product_brand>

如果你要选则第0块GPU信息打印:

代码语言:shell复制
nvidia-smi -q -i 0

如果你要每10S打印一次,

代码语言:shell复制
nvidia-smi -q -i 0 -l 10

如果你要每100ms打印一次,命令如下:

代码语言:shell复制
nvidia-smi -q -i 0 -lms 100

其中如果你要指定某一个属性打印或者说只获取某一个属性的值,则样例如下:

  • MEMORY
代码语言:shell复制
nvidia-smi -q -i 0 -d MEMORY

其中输出结果如下:

代码语言:shell复制
==============NVSMI LOG==============

Timestamp                                 : Tue Jan 16 21:35:49 2024
Driver Version                            : 537.70
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    FB Memory Usage
        Total                             : 8188 MiB
        Reserved                          : 225 MiB
        Used                              : 790 MiB
        Free                              : 7172 MiB
    BAR1 Memory Usage
        Total                             : 8192 MiB
        Used                              : 1 MiB
        Free                              : 8191 MiB
    Conf Compute Protected Memory Usage
        Total                             : N/A
        Used                              : N/A
        Free                              : N/A
  • UTILIZATION
代码语言:shell复制
nvidia-smi -q -i 0 -d UTILIZATION

输出结果如下:

代码语言:shell复制
==============NVSMI LOG==============

Timestamp                                 : Tue Jan 16 21:37:08 2024
Driver Version                            : 537.70
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Utilization
        Gpu                               : 4 %
        Memory                            : 4 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    GPU Utilization Samples
        Duration                          : 14.27 sec
        Number of Samples                 : 71
        Max                               : 15 %
        Min                               : 3 %
        Avg                               : 5 %
    Memory Utilization Samples
        Duration                          : 14.27 sec
        Number of Samples                 : 71
        Max                               : 10 %
        Min                               : 1 %
        Avg                               : 3 %
    ENC Utilization Samples
        Duration                          : 14.27 sec
        Number of Samples                 : 71
        Max                               : 0 %
        Min                               : 0 %
        Avg                               : 0 %
    DEC Utilization Samples
        Duration                          : 14.27 sec
        Number of Samples                 : 71
        Max                               : 0 %
        Min                               : 0 %
        Avg                               : 0 %
  • PIDS
代码语言:SHELL复制
nvidia-smi -q -i 0 -d PIDS

输出结果如下(输出结果有点多只列了2个进程):

代码语言:shell复制
==============NVSMI LOG==============

Timestamp                                 : Tue Jan 16 21:38:01 2024
Driver Version                            : 537.70
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1724
            Type                          : C G
            Name                          :
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 3200
            Type                          : C G
            Name                          : C:Program FilesWindowsAppsMicrosoft.StorePurchaseApp_12008.1001.1.0_x64__8wekyb3d8bbweStoreExperienceHost.exe
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 3292
            Type                          : C G
            Name                          : E:Program FilesTyporaTypora.exe
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 4616
            Type                          : C G
            Name                          : C:Program l
  • COMPUTE
代码语言:shell复制
nvidia-smi -q -i 0 -d COMPUTE
  • TEMPERATURE
代码语言:shell复制
nvidia-smi -q -i 0 -d TEMPERATURE

输出结果如下:

代码语言:shell复制
==============NVSMI LOG==============

Timestamp                                 : Tue Jan 16 21:39:58 2024
Driver Version                            : 537.70
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Temperature
        GPU Current Temp                  : 29 C
        GPU T.Limit Temp                  : 53 C
        GPU Shutdown T.Limit Temp         : -7 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A

如果你要指定其中的某几个属性,可以通过 ,来连接,比如说 MEMORY,TEMPERATURE,UTILIZATION,PIDS,其中命令如下:

代码语言:shell复制
nvidia-smi -q -i 0 -d MEMORY,UTILIZATION,PIDS,TEMPERATURE

输出结果如下(PIDS 只保留了一项,篇幅过大):

代码语言:shell复制
==============NVSMI LOG==============

Timestamp                                 : Tue Jan 16 21:43:17 2024
Driver Version                            : 537.70
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:01:00.0
    FB Memory Usage
        Total                             : 8188 MiB
        Reserved                          : 225 MiB
        Used                              : 784 MiB
        Free                              : 7178 MiB
    BAR1 Memory Usage
        Total                             : 8192 MiB
        Used                              : 1 MiB
        Free                              : 8191 MiB
    Conf Compute Protected Memory Usage
        Total                             : N/A
        Used                              : N/A
        Free                              : N/A
    Utilization
        Gpu                               : 7 %
        Memory                            : 6 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    GPU Utilization Samples
        Duration                          : 14.17 sec
        Number of Samples                 : 71
        Max                               : 12 %
        Min                               : 3 %
        Avg                               : 4 %
    Memory Utilization Samples
        Duration                          : 14.17 sec
        Number of Samples                 : 71
        Max                               : 20 %
        Min                               : 3 %
        Avg                               : 5 %
    ENC Utilization Samples
        Duration                          : 14.17 sec
        Number of Samples                 : 71
        Max                               : 0 %
        Min                               : 0 %
        Avg                               : 0 %
    DEC Utilization Samples
        Duration                          : 14.17 sec
        Number of Samples                 : 71
        Max                               : 0 %
        Min                               : 0 %
        Avg                               : 0 %
    Temperature
        GPU Current Temp                  : 29 C
        GPU T.Limit Temp                  : 53 C
        GPU Shutdown T.Limit Temp         : -7 C
        GPU Slowdown T.Limit Temp         : -2 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating T.Limit Temp : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1724
            Type                          : C G
            Name                          :
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 3200
            Type                          : C G
            Name                          : C:Program FilesWindowsAppsMicrosoft.StorePurchaseApp_12008.1001.1.0_x64__8wekyb3d8bbweStoreExperienceHost.exe
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 3292
            Type                          : C G
            Name                          : E:Program FilesTyporaTypora.exe
            Used GPU Memory               : Not available in WDDM driver model
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 4616
            Type                          : C G
            Name                          : C:Program 
            Name                          : 

这样的话如果你要做一个GPU-0的历史监控,监控 MEMORY,TEMPERATURE,UTILIZATION,PIDS这些属性,5s一个步长,命令可以这么设置:

代码语言:shell复制
nvidia-smi -q -i 0 -d MEMORY,UTILIZATION,TEMPERATURE,PIDS -l 5 -f "D:/monitor.log"

但是文件的解析就要自己来了。

SELECTIVE QUERY OPTIONS(选择性查询监控)

个人还是比较喜欢 SELECTIVE QUERY OPTIONS,其中查询的颗粒度比较详细,也更为便捷,上边的查询也较为简略一点。下边是关于SELECTIVE QUERY OPTIONS的一些命令:

代码语言:shell复制
    SELECTIVE QUERY OPTIONS:
    
    Allows the caller to pass an explicit list of properties to query.
    
    [one of]
    
    --query-gpu                 Information about GPU.
                                Call --help-query-gpu for more info.
    --query-supported-clocks    List of supported clocks.
                                Call --help-query-supported-clocks for more info.
    --query-compute-apps        List of currently active compute processes.
                                Call --help-query-compute-apps for more info.
    --query-accounted-apps      List of accounted compute processes.
                                Call --help-query-accounted-apps for more info.
                                This query is not supported on vGPU host.
    --query-retired-pages       List of device memory pages that have been retired.
                                Call --help-query-retired-pages for more info.
    --query-remapped-rows       Information about remapped rows.
                                Call --help-query-remapped-rows for more info.
    
    [mandatory]
    
    --format=                   Comma separated list of format options:
                                  csv - comma separated values (MANDATORY)
                                  noheader - skip the first line with column headers
                                  nounits - don't print units for numerical
                                             values
    
    [plus any of]
    
    -i,   --id=                 Target a specific GPU or Unit.
    -f,   --filename=           Log to a specified file, rather than to stdout.
    -l,   --loop=               Probe until Ctrl C at specified second interval.
    -lms, --loop-ms=            Probe until Ctrl C at specified millisecond interval.

大致内容如下:

--query-gpu 关于GPU的信息 --query-supported-clocks GPU支持时钟 --query-compute-apps 进程信息 --query-accounted-apps / --query-retired-pages / --query-remapped-rows 图形计算/设备内存页/映射 --format 输出格式设置 csv/noheader/nounits 逗号分隔/没有首行/不显示单位 -i/-f/-l/-lms 指定GPU/输出为文件/循环(S)/循环(ms)

根据开篇说到的 GPU的内存使用率和GPU的内存利用率还有进程相关信息,只关注 --query-gpu--query-compute-apps这两个选项,根据提示Call --help-query-gpu for more info--help-query-compute-apps获取更过的信息。

--query-gpu

详细信息通过以下命令输出:

代码语言:shell复制
nvidia-smi --help-query-gpu

因为内容过多就不贴出来了,拿几个示意来说明以下:

如果要输出gpu_name 或者 name,命令如下:

代码语言:shell复制
nvidia-smi --query-gpu=gpu_name --format=noheader

输出结果如下:

代码语言:shell复制
NVIDIA GeForce RTX 4060 Ti

如果要输出温度相关,相关属性有:

代码语言:shell复制
"temperature.gpu"
 Core GPU temperature. in degrees C.

"temperature.gpu.tlimit"
 GPU T.Limit temperature. in degrees C.

"temperature.memory"
 HBM memory temperature. in degrees C.

则命令如下:

代码语言:shell复制
nvidia-smi --query-gpu=temperature.gpu,temperature.gpu.tlimit,temperature.memory --format=csv

输出如下:

代码语言:shell复制
temperature.gpu, temperature.gpu.tlimit, temperature.memory
30, 53, N/A

如果要输出显存相关的一些属性,则相关属性有:

代码语言:SHELL复制
"memory.total"
Total installed GPU memory.

"memory.reserved"
Total memory reserved by the NVIDIA driver and firmware.

"memory.used"
Total memory allocated by active contexts.

"memory.free"
Total free memory.

命令如下:

代码语言:shell复制
nvidia-smi --query-gpu=memory.total,memory.reserved,memory.used,memory.free --format=csv

输出如下:

代码语言:shell复制
memory.total [MiB], memory.reserved [MiB], memory.used [MiB], memory.free [MiB]
8188 MiB, 225 MiB, 1032 MiB, 6930 MiB

如果不需要首行,则命令如下:

代码语言:shell复制
nvidia-smi --query-gpu=memory.total,memory.reserved,memory.used,memory.free --format=csv,noheader

输出如下:

代码语言:shell复制
8188 MiB, 225 MiB, 1031 MiB, 6931 MiB

如果不需要单位,则命令如下:

代码语言:shell复制
nvidia-smi --query-gpu=memory.total,memory.reserved,memory.used,memory.free --format=csv,noheader,nounits

输出如下:

代码语言:shell复制
8188, 225, 1044, 6918

如果要输出显卡利用率相关的属性,相关属性如下:

代码语言:shell复制
Section about utilization properties
Utilization rates report how busy each GPU is over time, and can be used to determine how much an application is using the GPUs in the system.
Note: On MIG-enabled GPUs, querying the utilization of encoder, decoder, jpeg, ofa, gpu, and memory is not currently supported.

"utilization.gpu"
Percent of time over the past sample period during which one or more kernels was executing on the GPU.
The sample period may be between 1 second and 1/6 second depending on the product.

"utilization.memory"
Percent of time over the past sample period during which global (device) memory was being read or written.
The sample period may be between 1 second and 1/6 second depending on the product.

"utilization.encoder"
Percent of time over the past sample period during which one or more kernels was executing on the Encoder Engine.
The sample period may be between 1 second and 1/6 second depending on the product.

"utilization.decoder"
Percent of time over the past sample period during which one or more kernels was executing on the Decoder Engine.
The sample period may be between 1 second and 1/6 second depending on the product.

"utilization.jpeg"
Percent of time over the past sample period during which one or more kernels was executing on the Jpeg Engine.
The sample period may be between 1 second and 1/6 second depending on the product.

"utilization.ofa"
Percent of time over the past sample period during which one or more kernels was executing on the Optical Flow Accelerator Engine.
The sample period may be between 1 second and 1/6 second depending on the product.

命令如下:

代码语言:shell复制
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,utilization.encoder,utilization.decoder,utilization.jpeg,utilization.ofa --format=csv

输出结果如下:

代码语言:shell复制
utilization.gpu [%], utilization.memory [%], utilization.encoder [%], utilization.decoder [%], utilization.jpeg [%], utilization.ofa [%]
23 %, 10 %, 0 %, 0 %, 0 %, 0 %

其中-f,-i.-l,-lms不做说明了,和上边的使用方法一样。

--query-compute-apps

关于使用GPU的系统进程信息详细命令说明如下:

代码语言:shell复制
nvidia-smi --help-query-compute-apps
List of valid properties to query for the switch "--query-compute-apps":

Section about Active Compute Processes properties
List of processes having compute context on the device.

"timestamp"
The timestamp of when the query was made in format "YYYY/MM/DD HH:MM:SS.msec".

"gpu_name"
The official product name of the GPU. This is an alphanumeric string. For all products.

"gpu_bus_id"
PCI bus id as "domain:bus:device.function", in hex.

"gpu_serial"
This number matches the serial number physically printed on each board. It is a globally unique immutable alphanumeric value.

"gpu_uuid"
This value is the globally unique immutable alphanumeric identifier of the GPU. It does not correspond to any physical label on the board.

"pid"
Process ID of the compute application

"process_name" or "name"
Process Name

"used_gpu_memory" or "used_memory"
Amount memory used on the device by the context. Not available on Windows when running in WDDM mode because Windows KMD manages all the memory not NVIDIA driver.

相关命令如下:

代码语言:shell复制
nvidia-smi --query-compute-apps=timestamp,gpu_name,gpu_bus_id,gpu_serial,gpu_uuid,pid,process_name,used_gpu_memory --format=csv

输出如下(只保留了3个进程):

代码语言:shell复制
timestamp, gpu_name, gpu_bus_id, gpu_serial, gpu_uuid, pid, process_name, used_gpu_memory [MiB]
2024/01/16 22:40:15.735, NVIDIA GeForce RTX 4060 Ti, 00000000:01:00.0, [N/A], GPU-3fd9292f-3024-fbdb-4596-5c5560b91654, 1724, [Insufficient Permissions], [N/A]
2024/01/16 22:40:15.735, NVIDIA GeForce RTX 4060 Ti, 00000000:01:00.0, [N/A], GPU-3fd9292f-3024-fbdb-4596-5c5560b91654, 11116, C:WindowsSystemAppsMicrosoft.Windows.StartMenuExperienceHost_cw5n1h2txyewyStartMenuExperienceHost.exe, [N/A]
2024/01/16 22:40:15.735, NVIDIA GeForce RTX 4060 Ti, 00000000:01:00.0, [N/A], GPU-3fd9292f-3024-fbdb-4596-5c5560b91654, 8584, C:Windowsexplorer.exe, [N/A]
2024/01/16 22:40:15.735, NVIDIA GeForce RTX 4060 Ti, 00000000:01:00.0, [N/A], GPU-3fd9292f-3024-fbdb-4596-5c5560b91654, 11956, C:Program FilesWindowsAppsMicrosoft.YourPhone_1.22022.147.0_x64__8wekyb3d8bbweYourPhone.exe, [N/A]
2024/01/16 22:40:15.735, NVIDIA GeForce RTX 4060 Ti, 00000000:01:00.0, [N/A], GPU-3fd9292f-3024-fbdb-4596-5c5560b91654, 15612, [Insufficient Permissions], [N/A]

显示了时间戳,使用GPU型号,bus_id,serial,uuid,pid, 进程名,进程使用显存大小。

注意:Only one --query-* switch can be used at a time.这些查询的命令同一时间只有能使用一个。

附言

本篇博文的使用版本要注意,有可能不同版本对应的命令会有一些区别,可以根据-h命令来查询详细使用情况。本片博文只介绍本人对nvidia-smi的理解以及使用方法,如果有不同意见欢迎提出指教。下篇博文预告会讲关于python http://pypi.python.org/pypi/nvidia-ml-py/的使用以及调用方法。

我正在参与2024腾讯技术创作特训营第五期有奖征文,快来和我瓜分大奖!

0 人点赞