如何在 AWS EC2 实例上激活 GPU 的使用？

Posted 2023-03-16

技术标签:

【中文标题】如何在 AWS EC2 实例上激活 GPU 的使用？【英文标题】：How to activate the use of a GPU on AWS EC2 instance? 【发布时间】：2020-12-20 15:59:25 【问题描述】：

我正在使用 AWS 在自定义数据集上训练 CNN。我启动了一个 p2.xlarge 实例，将我的 (Python) 脚本上传到虚拟机，并通过 CLI 运行我的代码。

我使用 Python3（CUDA 10.0 和 Intel MKL-DNN）为 TensorFlow(+Keras2) 激活了一个虚拟环境，这是 AWS 的默认选项。

我现在正在运行我的代码来训练网络，但感觉 GPU 没有被“激活”。训练的速度与我使用 CPU 在本地运行时一样快（慢）。

这是我正在运行的脚本：

https://github.com/AntonMu/TrainYourOwnYOLO/blob/master/2_Training/Train_YOLO.py

我还尝试通过将with tf.device('/device:GPU: 0'): 放在解析器之后（第 142 行）并缩进下面的所有内容来更改它。然而，这似乎并没有改变任何东西。

关于如何激活GPU（或检查GPU是否被激活）的任何提示？

【问题讨论】：

【参考方案1】：

最后它与我的 tensorflow 包有关！我不得不卸载 tensorflow 并安装 tensorflow-gpu。之后自动激活 GPU。

有关文档，请参阅：https://www.tensorflow.org/install/gpu

【讨论】：

【参考方案2】：

查看this answer 以列出可用的 GPU。

from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

您还可以将 CUDA 用于list the current device，如有必要，还可以使用set the device。

import torch

print(torch.cuda.is_available())
print(torch.cuda.current_device())

【讨论】：

感谢您的回答！我也可以通过运行nvidia-smi 查看可用的 GPU。所以我知道有一个可用的 GPU，只是在我运行我的代码时没有激活它。而这正是我想要解决的问题。您是否收到任何错误setting the device？嗨，迈尔斯。再次感谢您的评论！我试过这样做，我没有收到任何错误。我执行了以下具有以下输出的命令：torch.cuda.is_available() --> True、torch.cuda.is_initialized() -->False、torch.cuda.set_device(0)、torch.cuda.is_initialized() --> True。但是，处理速度并没有提高，很遗憾，nvidia-smi 仍然给我“没有正在运行的进程”。我还运行了您提出的第一个命令，但奇怪的是它没有返回 GPU。输出为：

[name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality   incarnation: 15573112437867445376 , name: "/device:XLA_CPU:0" device_type: "XLA_CPU" memory_limit: 17179869184 locality   incarnation: 9660188961145538128 physical_device_desc: "device: XLA_CPU device" ]

如果第一个命令没有返回任何内容，则 GPU 对 tensorflow 不可用。这可能有几个问题，但我会 1) 检查 GPU 是否可用于操作系统：lspci | grep VGA 应该返回 NVIDIA GPU。 2) 检查 tensorflow 和 cuda 的版本是否支持你的 GPU。您使用的是什么 AMI？

以上是关于如何在 AWS EC2 实例上激活 GPU 的使用？的主要内容，如果未能解决你的问题，请参考以下文章