在 Colab TPU 上启动 PyTorch Lightning 项目时出现导入错误
Posted
技术标签:
【中文标题】在 Colab TPU 上启动 PyTorch Lightning 项目时出现导入错误【英文标题】:Import error while launching PyTorch Lightning project on Colab TPU 【发布时间】:2022-01-05 05:17:03 【问题描述】:我关注了guide,在 Google Colab TPU 上启动了我的 PyTorch Lightning 项目。所以我安装了
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
然后
!pip install pytorch-lightning
然后我
!pip install torch torchvision torchaudio
!pip install -r requirements.txt
安装项目要求后,我按要求重新启动运行时并从上面重新运行 cloud-TPU-client 安装、pytorch-lightning 安装和这两个命令。运行顺利。
但是在 TPU 刚开始使用 PyTorch 1.9 版之后,我收到以下导入错误:
WARNING:root:TPU has started up successfully with version pytorch-1.9
Traceback (most recent call last):
File "synthesizer_train.py", line 2, in <module>
from synthesizer.train import train
File "/content/Real-Time-Voice-Cloning/synthesizer/train.py", line 6, in <module>
from synthesizer.models.tacotron import Tacotron
File "/content/Real-Time-Voice-Cloning/synthesizer/models/tacotron.py", line 7, in <module>
import pytorch_lightning as pl
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/__init__.py", line 20, in <module>
from pytorch_lightning.callbacks import Callback # noqa: E402
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/__init__.py", line 14, in <module>
from pytorch_lightning.callbacks.base import Callback
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/base.py", line 26, in <module>
from pytorch_lightning.utilities.types import STEP_OUTPUT
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/__init__.py", line 18, in <module>
from pytorch_lightning.utilities.apply_func import move_data_to_device # noqa: F401
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/apply_func.py", line 26, in <module>
from pytorch_lightning.utilities.imports import _compare_version, _TORCHTEXT_AVAILABLE
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/imports.py", line 101, in <module>
from pytorch_lightning.utilities.xla_device import XLADeviceUtils # noqa: E402
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/xla_device.py", line 24, in <module>
import torch_xla.core.xla_model as xm
File "/usr/local/lib/python3.7/dist-packages/torch_xla/__init__.py", line 142, in <module>
import _XLAC
ImportError: /usr/local/lib/python3.7/dist-packages/_XLAC.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN2at13_foreach_erf_EN3c108ArrayRefINS_6TensorEEE
Trainer
使用标志 TPU_cores=8
启动。
该模型事先已在 CPU 和 GPU 上运行(即在另一个会话中)。
我尝试将 PyTorch 降级到 1.9(与 TPU 启动时显示的相同),因为 Colab 使用了 Torch 1.10.0+cu111 并且出现了不同的错误:
WARNING:root:TPU has started up successfully with version pytorch-1.9
Traceback (most recent call last):
File "synthesizer_train.py", line 2, in <module>
from synthesizer.train import train
File "/content/Real-Time-Voice-Cloning/synthesizer/train.py", line 6, in <module>
from synthesizer.models.tacotron import Tacotron
File "/content/Real-Time-Voice-Cloning/synthesizer/models/tacotron.py", line 7, in <module>
import pytorch_lightning as pl
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/__init__.py", line 20, in <module>
from pytorch_lightning.callbacks import Callback # noqa: E402
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/__init__.py", line 14, in <module>
from pytorch_lightning.callbacks.base import Callback
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/base.py", line 26, in <module>
from pytorch_lightning.utilities.types import STEP_OUTPUT
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/__init__.py", line 18, in <module>
from pytorch_lightning.utilities.apply_func import move_data_to_device # noqa: F401
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/apply_func.py", line 29, in <module>
if _compare_version("torchtext", operator.ge, "0.9.0"):
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/imports.py", line 54, in _compare_version
pkg = importlib.import_module(package)
File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/usr/local/lib/python3.7/dist-packages/torchtext/__init__.py", line 5, in <module>
from . import vocab
File "/usr/local/lib/python3.7/dist-packages/torchtext/vocab/__init__.py", line 11, in <module>
from .vocab_factory import (
File "/usr/local/lib/python3.7/dist-packages/torchtext/vocab/vocab_factory.py", line 4, in <module>
from torchtext._torchtext import (
ImportError: /usr/local/lib/python3.7/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZTVN5torch3jit6MethodE
我可以做些什么来在 TPU 上训练模型?
非常感谢
【问题讨论】:
【参考方案1】:实际上,同样的问题也被描述过,suggested solution 确实对我有用。
因此,他们建议在安装torch_xla 后将PyTorch 降级为1.9.0+cu111
(注意+cu111
)。
因此,这是我使用 TPU 在 Google Colab 上启动我的 Lightning 项目所遵循的步骤:
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchtext==0.10.0 -f https://download.pytorch.org/whl/cu111/torch_stable.html
然后是项目的 pip :
!pip install torch torchvision torchaudio pytorch-lightning
!pip install -r requirements.txt
即使在最后一步之后,它仍然有效,我不得不重新启动运行时。
【讨论】:
以上是关于在 Colab TPU 上启动 PyTorch Lightning 项目时出现导入错误的主要内容,如果未能解决你的问题,请参考以下文章
使用 TPU 运行时在 Google Colab 上训练 Keras 模型时出错
在 colab 中使用 keras_to_tpu_model 时,TPU 运行速度与 CPU 一样慢