Microway 的 GPU Test Drive 集群作为我们向客户提供的基准测试服务,包含一组 NVIDIA 最新的 Tesla GPU。 这些是 NVIDIA 的高性能计算 GPU,可提供大量健康和状态信息。 以下示例取自此内部集群。
要列出所有可用的 NVIDIA 设备,请运行:
nvidia-smi -L
GPU 0: Tesla K40m (UUID: GPU-d0e093a0-c3b3-f458-5a55-6eb69fxxxxxx)
GPU 1: Tesla K40m (UUID: GPU-d105b085-7239-3871-43ef-975ecaxxxxxx)
要列出有关每个 GPU 的某些详细信息,请运行:
nvidia-smi --query-gpu=index,name,uuid,serial --format=csv
0, Tesla K40m, GPU-d0e093a0-c3b3-f458-5a55-6eb69fxxxxxx, 0323913xxxxxx
1, Tesla K40m, GPU-d105b085-7239-3871-43ef-975ecaxxxxxx, 0324214xxxxxx
以 1 秒的更新间隔监控整体 GPU 使用情况:
nvidia-smi dmon
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk# Idx W C C % % % % MHz MHz04335 - 00002505107514231 - 9790025051075(in this example, one GPU is idle and one GPU has 97% of the CUDA sm "cores"in use)
以 1 秒的更新间隔监控每个进程的 GPU 使用情况:
nvidia-smi pmon
# gpu pid type sm mem enc dec command# Idx # C/G % % % % name014835 C 451500 python
114945 C 645000 python
(in this case, two different python processes are running; one on each GPU)
nvidia-smi -q -d PERFORMANCE
GPU 00000000:18:00.0
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
nvidia-smi nvlink --status
GPU 0: Tesla V100-SXM2-32GB
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s
Link 2: 25.781 GB/s
Link 3: 25.781 GB/s
Link 4: 25.781 GB/s
Link 5: 25.781 GB/s
[snip]
GPU 7: Tesla V100-SXM2-32GB
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s
Link 2: 25.781 GB/s
Link 3: 25.781 GB/s
Link 4: 25.781 GB/s
Link 5: 25.781 GB/s
nvidia-smi nvlink --capabilities
GPU 0: Tesla V100-SXM2-32GB
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: false
Link 0, Link is supported: false[snip]
Link 5, P2P is supported: true
Link 5, Access to system memory supported: true
Link 5, P2P atomics supported: true
Link 5, System memory atomics supported: true
Link 5, SLI is supported: false
Link 5, Link is supported: false
如果您对这些主题有任何疑问,请联系我们的一位 HPC GPU 专家。
打印所有 GPU 详细信息
要列出特定 GPU 上的所有可用数据,请使用 -i 指定卡的 ID。 这是旧 Tesla GPU 卡的输出:
nvidia-smi -i 0 -q
==============NVSMI LOG==============
Timestamp : Mon Dec 522:05:49 2011
Driver Version :270.41.19
Attached GPUs :2
GPU 0:2:0
Product Name : Tesla M2090
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 032251100xxxx
GPU UUID : GPU-2b1486407f70xxxx-98bdxxxx-660cxxxx-1d6cxxxx-9fbd7e7cd9bf55a7cfb2xxxx
Inforom Version
OEM Object :1.1
ECC Object :2.0
Power Management Object :4.0
PCI
Bus :2
Device :0
Domain :0
Device Id : 109110DE
Bus Id :0:2:0
Fan Speed : N/A
Memory Usage
Total :5375 Mb
Used :9 Mb
Free :5365 Mb
Compute Mode : Default
Utilization
Gpu :0 %
Memory :0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory :0
Register File :0
L1 Cache :0
L2 Cache :0
Total :0
Double Bit
Device Memory :0
Register File :0
L1 Cache :0
L2 Cache :0
Total :0
Aggregate
Single Bit
Device Memory :0
Register File :0
L1 Cache :0
L2 Cache :0
Total :0
Double Bit
Device Memory :0
Register File :0
L1 Cache :0
L2 Cache :0
Total :0
Temperature
Gpu : N/A
Power Readings
Power State : P12
Power Management : Supported
Power Draw :31.57 W
Power Limit :225 W
Clocks
Graphics :50 MHz
SM :100 MHz
Memory :135 MHz
上面的示例显示了一张空闲卡。 以下是运行 GPU 加速 AMBER 的卡的摘录:
nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER,CLOCK,COMPUTE
==============NVSMI LOG==============
Timestamp : Mon Dec 522:32:00 2011
Driver Version :270.41.19
Attached GPUs :2
GPU 0:2:0
Memory Usage
Total :5375 Mb
Used :1904 Mb
Free :3470 Mb
Compute Mode : Default
Utilization
Gpu :67 %
Memory :42 %
Power Readings
Power State : P0
Power Management : Supported
Power Draw :109.83 W
Power Limit :225 W
Clocks
Graphics :650 MHz
SM :1301 MHz
Memory :1848 MHz
您会注意到,不幸的是,早期的 M 系列被动冷却 Tesla GPU 不会向 nvidia-smi 报告温度。 更新的 Quadro 和 Tesla GPU 支持更多的指标数据:
==============NVSMI LOG==============
Timestamp : Mon Nov 514:50:59 2018
Driver Version :410.48
Attached GPUs :4
GPU 00000000:18:00.0
Product Name : Tesla V100-PCIE-32GB
Product Brand : Tesla
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size :4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 032161808xxxx
GPU UUID : GPU-4965xxxx-79e3-7941-12cb-1dfe9c53xxxx
Minor Number :0
VBios Version :88.00.48.00.02
MultiGPU Board : No
Board ID : 0x1800
GPU Part Number :900-2G500-0010-000
Inforom Version
Image Version : G500.0202.00.02
OEM Object :1.1
ECC Object :5.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x18
Device : 0x00
Domain : 0x0000
Device Id : 0x1DB610DE
Bus Id : 00000000:18:00.0
Sub System Id : 0x124A10DE
GPU Link Info
PCIe Generation
Max :3
Current :3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset :0
Tx Throughput :31000 KB/s
Rx Throughput :155000 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total :32480 MiB
Used :31194 MiB
Free :1286 MiB
BAR1 Memory Usage
Total :32768 MiB
Used :8 MiB
Free :32760 MiB
Compute Mode : Default
Utilization
Gpu :44 %
Memory :4 %
Encoder :0 %
Decoder :0 %
Encoder Stats
Active Sessions :0
Average FPS :0
Average Latency :0
FBC Stats
Active Sessions :0
Average FPS :0
Average Latency :0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory :0
Register File :0
L1 Cache :0
L2 Cache :0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total :0
Double Bit
Device Memory :0
Register File :0
L1 Cache :0
L2 Cache :0
Texture Memory : N/A
Texture Shared : N/A
CBU :0
Total :0
Aggregate
Single Bit
Device Memory :0
Register File :0
L1 Cache :0
L2 Cache :0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total :0
Double Bit
Device Memory :0
Register File :0
L1 Cache :0
L2 Cache :0
Texture Memory : N/A
Texture Shared : N/A
CBU :0
Total :0
Retired Pages
Single Bit ECC :0
Double Bit ECC :0
Pending : No
Temperature
GPU Current Temp :40 C
GPU Shutdown Temp :90 C
GPU Slowdown Temp :87 C
GPU Max Operating Temp :83 C
Memory Current Temp :39 C
Memory Max Operating Temp :85 C
Power Readings
Power Management : Supported
Power Draw :58.81 W
Power Limit :250.00 W
Default Power Limit :250.00 W
Enforced Power Limit :250.00 W
Min Power Limit :100.00 W
Max Power Limit :250.00 W
Clocks
Graphics :1380 MHz
SM :1380 MHz
Memory :877 MHz
Video :1237 MHz
Applications Clocks
Graphics :1230 MHz
Memory :877 MHz
Default Applications Clocks
Graphics :1230 MHz
Memory :877 MHz
Max Clocks
Graphics :1380 MHz
SM :1380 MHz
Memory :877 MHz
Video :1237 MHz
Max Customer Boost Clocks
Graphics :1380 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID :315406
Type : C
Name : /usr/bin/python
Used GPU Memory :31181 MiB
-pm, --persistence-mode= Set persistence mode: 0/DISABLED, 1/ENABLED
-e, --ecc-config= Toggle ECC support: 0/DISABLED, 1/ENABLED
-p, --reset-ecc-errors= Reset ECC error counts: 0/VOLATILE, 1/AGGREGATE
-c, --compute-mode= Set MODE for compute applications:
0/DEFAULT, 1/EXCLUSIVE_PROCESS,
2/PROHIBITED
--gom= Set GPU Operation Mode:
0/ALL_ON, 1/COMPUTE, 2/LOW_DP
-r --gpu-reset Trigger reset of the GPU.
Can be used to reset the GPU HW state in situations
that would otherwise require a machine reboot.
Typically useful if a double bit ECC error has
occurred.
Reset operations are not guarenteed to work in
all cases and should be used with caution.
-vm --virt-mode= Switch GPU Virtualization Mode:
Sets GPU virtualization mode to 3/VGPU or 4/VSGA
Virtualization mode of a GPU can only be set when
it is running on a hypervisor.
-lgc --lock-gpu-clocks= Specifies clocks as a
pair (e.g. 1500,1500) that defines the range
of desired locked GPU clock speed in MHz.
Setting this will supercede application clocks
and take effect regardless if an app is running.
Input can also be a singular desired clock value
(e.g. ).
-rgc --reset-gpu-clocks
Resets the Gpu clocks to the default values.
-ac --applications-clocks= Specifies clocks as a
pair (e.g. 2000,800) that defines GPU's
speed in MHz while running applications on a GPU.
-rac --reset-applications-clocks
Resets the applications clocks to the default values.
-acp --applications-clocks-permission=
Toggles permission requirements for -ac and -rac commands:
0/UNRESTRICTED, 1/RESTRICTED
-pl --power-limit= Specifies maximum power management limit in watts.
-am --accounting-mode= Enable or disable Accounting Mode: 0/DISABLED, 1/ENABLED
-caa --clear-accounted-apps
Clears all the accounted PIDs in the buffer.
--auto-boost-default= Set the default auto boost policy to 0/DISABLED
or 1/ENABLED, enforcing the change only after the
last boost client has exited.
--auto-boost-permission=
Allow non-admin/root control over auto boost mode:
0/UNRESTRICTED, 1/RESTRICTED