Nvidia GPU信息nvidia-smi(Persistence-M持久性内存Volatile Uncorr. ECC显存错误校正GPU-Util显卡利用率Compute M.显卡计算模式)

Posted Dontla

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Nvidia GPU信息nvidia-smi(Persistence-M持久性内存Volatile Uncorr. ECC显存错误校正GPU-Util显卡利用率Compute M.显卡计算模式)相关的知识,希望对你有一定的参考价值。

文章目录


C:\\Users\\SIQI>cd C:\\Program Files\\NVIDIA Corporation\\NVSMI

C:\\Program Files\\NVIDIA Corporation\\NVSMI>nvidia-smi -persistenced --user foo
Invalid combination of input arguments. Please run 'nvidia-smi -h' for help.


C:\\Program Files\\NVIDIA Corporation\\NVSMI>nvidia-smi -persistenced
Invalid combination of input arguments. Please run 'nvidia-smi -h' for help.


C:\\Program Files\\NVIDIA Corporation\\NVSMI>nvidia-smi -h
NVIDIA System Management Interface -- v441.08

NVSMI provides monitoring information for Tesla and select Quadro devices.
The data is presented in either a plain text or an XML format, via stdout or a file.
NVSMI also provides several management operations for changing the device state.

NVSMI为Tesla和某些Quadro设备提供监视信息。
数据通过标准输出或文件以纯文本或XML格式显示。
NVSMI还提供了几种管理操作来更改设备状态。

Note that the functionality of NVSMI is exposed through the NVML C-based
library. See the NVIDIA developer website for more information about NVML.
Python wrappers to NVML are also available.  The output of NVSMI is
not guaranteed to be backwards compatible; NVML and the bindings are backwards
compatible.

请注意,NVSMI的功能是通过基于NVML C的库公开的。 有关NVML的更多信息,请参见NVIDIA开发人员网站。
也可以使用NVML的Python包装器。 NVSMI的输出是
不保证向后兼容; NVML和绑定是向后兼容的。

http://developer.nvidia.com/nvidia-management-library-nvml/
http://pypi.python.org/pypi/nvidia-ml-py/
Supported products:
- Full Support
    - All Tesla products, starting with the Kepler architecture
    - All Quadro products, starting with the Kepler architecture
    - All GRID products, starting with the Kepler architecture
    - GeForce Titan products, starting with the Kepler architecture
- Limited Support
    - All Geforce products, starting with the Kepler architecture
nvidia-smi [OPTION1 [ARG1]] [OPTION2 [ARG2]] ...

    -h,   --help                Print usage information and exit. 
    							打印使用信息并退出。(nvidia-smi -h)

  LIST OPTIONS 列表选项:

    -L,   --list-gpus           Display a list of GPUs connected to the system. 
    							显示连接到系统的GPU的列表。(nvidia-smi -l)

    -B,   --list-blacklist-gpus Display a list of blacklisted GPUs in the system.
    							显示系统中列入黑名单的GPU的列表。(nvidia-smi -B)

  SUMMARY OPTIONS 摘要选项:

    <no arguments>              Show a summary of GPUs connected to the system.
    							显示连接到系统的GPU的摘要。(nvidia-smi)

    [plus any of]

    -i,   --id=                 Target a specific GPU. 
    							定位到特定的GPU。
    -f,   --filename=           Log to a specified file, rather than to stdout.
   								登录到指定文件,而不是stdout。
    -l,   --loop=               Probe until Ctrl+C at specified second interval.
    							以指定的第二时间间隔探测直到Ctrl + C。

  QUERY OPTIONS 查询选项:

    -q,   --query               Display GPU or Unit info.
    							显示GPU或单元信息。(nvidia-smi -q)

    [plus any of 加上任何]

    -u,   --unit                Show unit, rather than GPU, attributes.
    							显示单位而不是GPU属性。(nvidia-smi -q -u)
    -i,   --id=                 Target a specific GPU or Unit.
    							定位到特定的GPU或单元。(nvidia-smi -q -i 1-f,   --filename=           Log to a specified file, rather than to stdout.
    							登录到指定文件,而不是stdout。
    -x,   --xml-format          Produce XML output.
    							产生XML输出。
          --dtd                 When showing xml output, embed DTD.
          						显示xml输出时,嵌入DTD。
    -d,   --display=            Display only selected information: 
    							仅显示所选信息
    								MEMORY,
                                    UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK,
                                    COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS,
                                    PAGE_RETIREMENT, ACCOUNTING, ENCODER_STATS, FBC_STATS

								如:
								nvidia-smi -q -d MEMORY
								nvidia-smi -q -d MEMORY,UTILIZATION
								
                                Flags can be combined with comma e.g. ECC,POWER.
                                Sampling data with max/min/avg is also returned
                                for POWER, UTILIZATION and CLOCK display types.
                                Doesn't work with -u or -x flags.
                                标志可以与逗号结合使用,例如 ECC,电源。 
                                对于POWER,UTILIZATION和CLOCK显示类型,
                                还将返回具有max / min / avg的采样数据。 不适用于-u或-x标志。
    -l,   --loop=               Probe until Ctrl+C at specified second interval.
    							以指定的第二时间间隔探测直到Ctrl + C。

    -lms, --loop-ms=            Probe until Ctrl+C at specified millisecond interval.
    							以指定的毫秒间隔探测直到Ctrl + C。

  SELECTIVE QUERY OPTIONS 选择性查询选项:

    Allows the caller to pass an explicit list of properties to query.
    允许调用者传递明确的属性列表以进行查询。

    [one of]

    --query-gpu=                Information about GPU.
    							有关GPU的信息。
                                Call --help-query-gpu for more info.
    --query-supported-clocks=   List of supported clocks.
    							支持的时钟列表。
                                Call --help-query-supported-clocks for more info.
    --query-compute-apps=       List of currently active compute processes.
    							当前活动的计算进程列表。
                                Call --help-query-compute-apps for more info.
    --query-accounted-apps=     List of accounted compute processes.
    							会计计算流程列表。
                                Call --help-query-accounted-apps for more info.
    --query-retired-pages=      List of device memory pages that have been retired.
    							已淘汰的设备内存页面列表。
                                Call --help-query-retired-pages for more info.

    [mandatory 强制性的]

    --format=                   Comma separated list of format options:
    							以逗号分隔的格式选项列表:
                                  csv - comma separated values (MANDATORY)
                                  noheader - skip the first line with column headers
                                  nounits - don't print units for numerical
                                             values

    [plus any of]

    -i,   --id=                 Target a specific GPU or Unit.
    -f,   --filename=           Log to a specified file, rather than to stdout.
    -l,   --loop=               Probe until Ctrl+C at specified second interval.
    -lms, --loop-ms=            Probe until Ctrl+C at specified millisecond interval.

  DEVICE MODIFICATION OPTIONS 设备修改选项:

    [any one of]

    -e,   --ecc-config=         Toggle ECC support: 0/DISABLED, 1/ENABLED
    -p,   --reset-ecc-errors=   Reset ECC error counts: 0/VOLATILE, 1/AGGREGATE
    -c,   --compute-mode=       Set MODE for compute applications:
                                0/DEFAULT, 1/EXCLUSIVE_PROCESS,
                                2/PROHIBITED
    -dm,  --driver-model=       Enable or disable TCC mode: 0/WDDM, 1/TCC
    -fdm, --force-driver-model= Enable or disable TCC mode: 0/WDDM, 1/TCC
                                Ignores the error that display is connected.
          --gom=                Set GPU Operation Mode:
                                    0/ALL_ON, 1/COMPUTE, 2/LOW_DP
    -lgc  --lock-gpu-clocks=    Specifies <minGpuClock,maxGpuClock> clocks as a
                                    pair (e.g. 1500,1500) that defines the range
                                    of desired locked GPU clock speed in MHz.
                                    Setting this will supercede application clocks
                                    and take effect regardless if an app is running.
                                    Input can also be a singular desired clock value
                                    (e.g. <GpuClockValue>).
    -rgc  --reset-gpu-clocks
                                Resets the Gpu clocks to the default values.
    -ac   --applications-clocks= Specifies <memory,graphics> clocks as a
                                    pair (e.g. 2000,800) that defines GPU's
                                    speed in MHz while running applications on a GPU.
    -rac  --reset-applications-clocks
                                Resets the applications clocks to the default values.
    -acp  --applications-clocks-permission=
                                Toggles permission requirements for -ac and -rac commands:
                                0/UNRESTRICTED, 1/RESTRICTED
    -pl   --power-limit=        Specifies maximum power management limit in watts.
    -cc   --cuda-clocks=        Overrides or restores default CUDA clocks.
                                In override mode, GPU clocks higher frequencies when running CUDA applications.
                                Only on supported devices starting from the Volta series.
                                Requires administrator privileges.
                                0/RESTORE_DEFAULT, 1/OVERRIDE
    -am   --accounting-mode=    Enable or disable Accounting Mode: 0/DISABLED, 1/ENABLED
    -caa  --clear-accounted-apps
                                Clears all the accounted PIDs in the buffer.
          --auto-boost-default= Set the default auto boost policy to 0/DISABLED
                                or 1/ENABLED, enforcing the change only after the
                                last boost client has exited.
          --auto-boost-permission=
                                Allow non-admin/root control over auto boost mode:
                                0/UNRESTRICTED, 1/RESTRICTED

   [plus optional]

    -i,   --id=                 Target a specific GPU.
    -eow, --error-on-warning    Return a non-zero error for warnings.

  UNIT MODIFICATION OPTIONS:

    -t,   --toggle-led=         Set Unit LED state: 0/GREEN, 1/AMBER

   [plus optional]

    -i,   --id=                 Target a specific Unit.

  SHOW DTD OPTIONS:

          --dtd                 Print device DTD and exit.

     [plus optional]

    -f,   --filename=           Log to a specified file, rather than to stdout.
    -u,   --unit                Show unit, rather than device, DTD.

    --debug=                    Log encrypted debug information to a specified file.

 Device Monitoring:
    dmon                        Displays device stats in scrolling format.
                                "nvidia-smi dmon -h" for more information.

    daemon                      Runs in background and monitor devices as a daemon process.
                                This is an experimental feature. Not supported on Windows baremetal
                                "nvidia-smi daemon -h" for more information.

    replay                      Used to replay/extract the persistent stats generated by daemon.
                                This is an experimental feature.
                                "nvidia-smi replay -h" for more information.

 Process Monitoring:
    pmon                        Displays process stats in scrolling format.
                                "nvidia-smi pmon -h" for more information.

 NVLINK:
    nvlink                      Displays device nvlink information. "nvidia-smi nvlink -h" for more information.

 CLOCKS:
    clocks                      Control and query clock information. "nvidia-smi clocks -h" for more information.

 ENCODER SESSIONS:
    encodersessions             Displays device encoder sessions information. "nvidia-smi encodersessions -h" for more information.

 FBC SESSIONS:
    fbcsessions                 Displays device FBC sessions information. "nvidia-smi fbcsessions -h" for more information.

Please see the nvidia-smi documentation for more detailed information.

C:\\Program Files\\NVIDIA Corporation\\NVSMI>

ubuntu20.04 nvidia-smi指令信息

nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:35:00.0 Off |                    0 |
| N/A   62C    P0    29W /  70W |   5268MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            Off  | 00000000:36:00.0 Off |                    0 |
| N/A   54C    P0    28W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            Off  | 00000000:9C:00.0 Off |                    0 |
| N/A   51C    P0    26W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            Off  | 00000000:9D:00.0 Off |                    0 |
| N/A   52C    P0    27W /  70W |      0MiB / 15109MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      7311      C   /opt/tensorrtserver/bin/trtserver           5225MiB |
+-----------------------------------------------------------------------------+

释义:

  • Driver Version:显卡驱动版本号
  • CUDA Version:CUDA版本号
  • GPU Name:显卡名称
  • Persistence-M:是否支持持久性内存(Persistence-M是一种用于保存显卡驱动状态的特殊内存类型。如果启用了持久性内存,那么在显卡重启或掉电后,内存上的信息仍将保存下来。这种内存类型通常用于保存模型的长期数据,以便在不需要重新加载的情况下加速显卡操作。)
  • Bus-Id:显卡的总线ID(Bus-Id是显卡的总线ID,是该显卡在主板上的唯一标识。它是一个数字字符串,用于标识显卡在总线系统中的位置,方便系统识别和管理。每个显卡都有一个不同的总线ID,用于防止显卡混淆,以及为显卡提供管理和配置信息。)
  • Disp.A:显卡的显示状态(是否启用)(Disp.A是Display Active的缩写,表示是否有显示器激活,如果有显示器激活,则会显示Yes,否则显示No。显示器激活指的是在电脑系统中打开或使用外接的显示器,使其可以正常显示画面。)
  • Volatile Uncorr. ECC:是否启用显

    以上是关于Nvidia GPU信息nvidia-smi(Persistence-M持久性内存Volatile Uncorr. ECC显存错误校正GPU-Util显卡利用率Compute M.显卡计算模式)的主要内容,如果未能解决你的问题,请参考以下文章

    替代 nvidia-smi 来测量 GPU 利用率?

    nvidia-smi 命令解读

    nvidia-smi 无法初始化 NVML:GPU 访问被操作系统阻止

    Linux中如何管理Nvidia GPU卡

    使用nvidia-smi命令查看显卡信息

    nvidia-smi 的输出中的“关闭”是啥意思?