nvidia-smi 命令是做啥的? [关闭]

Posted

技术标签:

【中文标题】nvidia-smi 命令是做啥的? [关闭]【英文标题】:what is the nvidia-smi command do? [closed]nvidia-smi 命令是做什么的? [关闭] 【发布时间】:2021-12-27 08:42:03 【问题描述】:

我想知道什么

does. Is it just freeing up the memory of GPU and do nothing?

【问题讨论】:

【参考方案1】:

重置 GPU 状态。可用于清除双位 ECC 错误或恢复挂起的 GPU。需要 -i 切换到目标特定设备。仅在 Linux 上可用。 https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf

【讨论】:

这是危险的评论吗?还是只是重置内存?【参考方案2】:

来自nvidia-smi 帮助菜单 (man nvidia-smi):

-r, --gpu-reset
       Trigger a reset of one or more GPUs.  Can be used to clear GPU HW and SW state in situations that would otherwise require a machine reboot.  Typically useful if a double bit ECC
       error has occurred.  Optional -i switch can be used to target one or more specific devices.  Without this option, all GPUs are reset.  Requires root.  There can't be any  appli‐
       cations  using these devices (e.g. CUDA application, graphics application like X server, monitoring application like other instance of nvidia-smi).  There also can't be any com‐
       pute applications running on any other GPU in the system.

       Starting with the NVIDIA Ampere architecture, GPUs with NVLink connections can be individually reset.  On NVSwitch systems, Fabric Manager is required to facilitate reset.

       If Fabric Manager is not running, or if any of the GPUs being reset are based on an architecture preceding the NVIDIA Ampere architecture, any GPUs with NVLink connections to  a
       GPU  being  reset  must also be reset in the same command.  This can be done either by omitting the -i switch, or using the -i switch to specify the GPUs to be reset.  If the -i
       option does not specify a complete set of NVLink GPUs to reset, this command will issue an error identifying the additional GPUs that must be included in the reset command.

       GPU reset is not guaranteed to work in all cases. It is not recommended for production environments at this time.  In some situations there may be HW  components  on  the  board
       that  fail to revert back to an initial state following the reset request.  This is more likely to be seen on Fermi-generation products vs. Kepler, and more likely to be seen if
       the reset is being performed on a hung GPU.

       Following a reset, it is recommended that the health of each reset GPU be verified before further use.  If any GPU is not healthy a complete reset should be instigated by  power
       cycling the node.

       GPU reset operation will not be supported on MIG enabled vGPU guests.

       Visit http://developer.nvidia.com/gpu-deployment-kit to download the GDK.

【讨论】:

以上是关于nvidia-smi 命令是做啥的? [关闭]的主要内容,如果未能解决你的问题,请参考以下文章

AFNetworking 中的 registerHTTPOperationClass 实际上是做啥的? [关闭]

numpy.exp() 到底是做啥的? [关闭]

DOS下的命令telnet 对方IP端口是做啥的?

代码是做啥的?这个语法是啥:list(board[r,:])? [关闭]

python中的KFold到底是做啥的?

在linux中"chmod u+s "这个命令是做啥的?