使用 GPU 在 colab 上使用 Turicreate 训练对象检测模型

Posted 2023-03-16

技术标签:

【中文标题】使用 GPU 在 colab 上使用 Turicreate 训练对象检测模型【英文标题】：Training with GPU an object detection model on colab with Turicreate 【发布时间】：2021-10-27 14:31:48 【问题描述】：

我正在尝试使用带有 TuriCreate 的 GPU 在 Google Colab 上训练对象检测模型。

根据 TuriCreate 的存储库，要在训练期间使用 gpu，您必须遵循以下说明：

https://github.com/apple/turicreate/blob/main/LinuxGPU.md

但是，每次我开始训练时，shell 在开始训练之前都会产生以下输出：

"Using CPU to create model."

我的 colab 的结构如下：

设置 cuda 环境

!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
!sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
!sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
!sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
!sudo apt-get update

!wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

!sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
!sudo apt-get update

!wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
!sudo apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
!sudo apt-get update

# Install development and runtime libraries (~4GB)
!sudo apt-get install --no-install-recommends \
    cuda-11-0 \
    libcudnn8=8.0.4.30-1+cuda11.0  \
    libcudnn8-dev=8.0.4.30-1+cuda11.0

# Install TensorRT. Requires that libcudnn8 is installed above.
!sudo apt-get install -y --no-install-recommends libnvinfer7=7.1.3-1+cuda11.0 \
    libnvinfer-dev=7.1.3-1+cuda11.0 \
    libnvinfer-plugin7=7.1.3-1+cuda11.0

tc.config.set_num_gpus(-1)
model = tc.object_detector.create(train_sf)
scores = model.evaluate(valid_sf)
print(scores['mean_average_precision'])
model.export_coreml('model.mlmodel')

使用 nvidia-smi 检查安装

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

依赖安装

!pip install turicreate
!pip uninstall -y tensorflow
!pip install tensorflow-gpu

设置 bash 环境变量

!echo export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH >> ~/.bashrc

培训

tc.config.set_num_gpus(-1)
model = tc.object_detector.create(train_sf)
scores = model.evaluate(valid_sf)
print(scores['mean_average_precision'])
model.export_coreml('model.mlmodel')

这是输出

TuriCreate currently only supports using one GPU. Setting 'num_gpus' to 1.
Using 'image' as feature column
Using 'annotations' as annotations column

Using CPU to create model.

Setting 'batch_size' to 32

我无法理解我错过了什么。

【问题讨论】：

为什么不使用 TensorFlow 或 Keras？ 【参考方案1】：

我设法解决了这个问题：问题是由于 colab 机器上预装的 tensorflow 版本造成的。

!pip uninstall -y tensorflow
!pip uninstall -y tensorflow-gpu
!pip install turicreate
!pip install tensorflow==2.4.0

【讨论】：

以上是关于使用 GPU 在 colab 上使用 Turicreate 训练对象检测模型的主要内容，如果未能解决你的问题，请参考以下文章