在 Colab TPU 上启动 PyTorch Lightning 项目时出现导入错误

Posted

技术标签:

【中文标题】在 Colab TPU 上启动 PyTorch Lightning 项目时出现导入错误【英文标题】:Import error while launching PyTorch Lightning project on Colab TPU 【发布时间】:2022-01-05 05:17:03 【问题描述】:

我关注了guide,在 Google Colab TPU 上启动了我的 PyTorch Lightning 项目。所以我安装了

!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl

然后

 !pip install pytorch-lightning

然后我

!pip install torch torchvision torchaudio 
!pip install -r requirements.txt

安装项目要求后,我按要求重新启动运行时并从上面重新运行 cloud-TPU-client 安装、pytorch-lightning 安装和这两个命令。运行顺利。

但是在 TPU 刚开始使用 PyTorch 1.9 版之后,我收到以下导入错误:

WARNING:root:TPU has started up successfully with version pytorch-1.9
        Traceback (most recent call last):
          File "synthesizer_train.py", line 2, in <module>
            from synthesizer.train import train
          File "/content/Real-Time-Voice-Cloning/synthesizer/train.py", line 6, in <module>
            from synthesizer.models.tacotron import Tacotron
          File "/content/Real-Time-Voice-Cloning/synthesizer/models/tacotron.py", line 7, in <module>
            import pytorch_lightning as pl
          File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/__init__.py", line 20, in <module>
            from pytorch_lightning.callbacks import Callback  # noqa: E402
          File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/__init__.py", line 14, in <module>
            from pytorch_lightning.callbacks.base import Callback
          File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/base.py", line 26, in <module>
            from pytorch_lightning.utilities.types import STEP_OUTPUT
          File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/__init__.py", line 18, in <module>
            from pytorch_lightning.utilities.apply_func import move_data_to_device  # noqa: F401
          File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/apply_func.py", line 26, in <module>
            from pytorch_lightning.utilities.imports import _compare_version, _TORCHTEXT_AVAILABLE
          File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/imports.py", line 101, in <module>
            from pytorch_lightning.utilities.xla_device import XLADeviceUtils  # noqa: E402
          File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/xla_device.py", line 24, in <module>
            import torch_xla.core.xla_model as xm
          File "/usr/local/lib/python3.7/dist-packages/torch_xla/__init__.py", line 142, in <module>
            import _XLAC
        ImportError: /usr/local/lib/python3.7/dist-packages/_XLAC.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN2at13_foreach_erf_EN3c108ArrayRefINS_6TensorEEE

Trainer 使用标志 TPU_cores=8 启动。

该模型事先已在 CPU 和 GPU 上运行(即在另一个会话中)。

我尝试将 PyTorch 降级到 1.9(与 TPU 启动时显示的相同),因为 Colab 使用了 Torch 1.10.0+cu111 并且出现了不同的错误:

WARNING:root:TPU has started up successfully with version pytorch-1.9
Traceback (most recent call last):
  File "synthesizer_train.py", line 2, in <module>
    from synthesizer.train import train
  File "/content/Real-Time-Voice-Cloning/synthesizer/train.py", line 6, in <module>
    from synthesizer.models.tacotron import Tacotron
  File "/content/Real-Time-Voice-Cloning/synthesizer/models/tacotron.py", line 7, in <module>
    import pytorch_lightning as pl
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning.callbacks import Callback  # noqa: E402
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/__init__.py", line 14, in <module>
    from pytorch_lightning.callbacks.base import Callback
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/base.py", line 26, in <module>
    from pytorch_lightning.utilities.types import STEP_OUTPUT
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/__init__.py", line 18, in <module>
    from pytorch_lightning.utilities.apply_func import move_data_to_device  # noqa: F401
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/apply_func.py", line 29, in <module>
    if _compare_version("torchtext", operator.ge, "0.9.0"):
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/imports.py", line 54, in _compare_version
    pkg = importlib.import_module(package)
  File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/usr/local/lib/python3.7/dist-packages/torchtext/__init__.py", line 5, in <module>
    from . import vocab
  File "/usr/local/lib/python3.7/dist-packages/torchtext/vocab/__init__.py", line 11, in <module>
    from .vocab_factory import (
  File "/usr/local/lib/python3.7/dist-packages/torchtext/vocab/vocab_factory.py", line 4, in <module>
    from torchtext._torchtext import (
ImportError: /usr/local/lib/python3.7/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZTVN5torch3jit6MethodE

我可以做些什么来在 TPU 上训练模型?

非常感谢

【问题讨论】:

【参考方案1】:

实际上,同样的问题也被描述过,suggested solution 确实对我有用。

因此,他们建议在安装torch_xla 后将PyTorch 降级为1.9.0+cu111(注意+cu111)。

因此,这是我使用 TPU 在 Google Colab 上启动我的 Lightning 项目所遵循的步骤:

!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchtext==0.10.0 -f https://download.pytorch.org/whl/cu111/torch_stable.html

然后是项目的 pip :

!pip install torch torchvision torchaudio pytorch-lightning
!pip install -r requirements.txt

即使在最后一步之后,它仍然有效,我不得不重新启动运行时。

【讨论】:

以上是关于在 Colab TPU 上启动 PyTorch Lightning 项目时出现导入错误的主要内容,如果未能解决你的问题,请参考以下文章

将 TPU 与 PyTorch 一起使用

在 Colab TPU 上保存模型时速度极慢

使用 TPU 运行时在 Google Colab 上训练 Keras 模型时出错

在 colab 中使用 keras_to_tpu_model 时,TPU 运行速度与 CPU 一样慢

使用 GOOGLE COLAB TPU 在 IMAGENET 上训练 VGG-16 模型需要多长时间?

为啥 Google Colab TPU 和我的电脑一样慢?