AttributeError: module ‘torch.distributed‘ has no attribute ‘_all_gather_base‘
Posted AI浩
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了AttributeError: module ‘torch.distributed‘ has no attribute ‘_all_gather_base‘相关的知识,希望对你有一定的参考价值。
问题描述
安装完apex后,调用的是时候出现如下错误:
File "/home/shuyuan/anaconda3/envs/shuyuan/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/__init__.py", line 3, in <module>
from apex.transformer.pipeline_parallel.schedules.fwd_bwd_no_pipelining import (
File "/home/shuyuan/anaconda3/envs/shuyuan/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/fwd_bwd_no_pipelining.py", line 10, in <module>
from apex.transformer.pipeline_parallel.schedules.common import Batch
File "/home/shuyuan/anaconda3/envs/shuyuan/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/schedules/common.py", line 9, in <module>
from apex.transformer.pipeline_parallel.p2p_communication import FutureTensor
File "/home/shuyuan/anaconda3/envs/shuyuan/lib/python3.8/site-packages/apex/transformer/pipeline_parallel/p2p_communication.py", line 25, in <module>
from apex.transformer.utils import split_tensor_into_1d_equal_chunks
File "/home/shuyuan/anaconda3/envs/shuyuan/lib/python3.8/site-packages/apex/transformer/utils.py", line 11, in <module>
torch.distributed.all_gather_into_tensor = torch.distributed._all_gather_base
AttributeError: module 'torch.distributed' has no attribute '_all_gather_base'
解决方法
注释下面的代码:
if "reduce_scatter_tensor" not in dir(torch.distributed):
torch.distributed.reduce_scatter_tensor = torch.distributed._reduce_scatter_base
if "all_gather_into_tensor" not in dir(torch.distributed):
torch.distributed.all_gather_into_tensor = torch.distributed._all_gather_base
路径:
apex/contrib/optimizers/distributed_fused_lamb.py
apex/transformer/tensor_parallel/layers.py
apex/transformer/tensor_parallel/utils.py
apex/transformer/tensor_parallel/mappings.py
接下来添加环境变量。
执行命令vi ~/.bashrc
打开文件,然后,按i
键进入编辑模式。
在末尾添加:
export TORCH_CUDA_ARCH_LIST="8.0" # CUDA11.X,对应的算力为8.0
然后,按ESC
键,退出编辑模型,按Shift+;
输入:
,最后再按wq
键,保存并退出。
再执行:
source ~/.bashrc
更新配置
接下来安装apex
进入apex的根目录,执行命令:
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
等待编译和安装。
以上是关于AttributeError: module ‘torch.distributed‘ has no attribute ‘_all_gather_base‘的主要内容,如果未能解决你的问题,请参考以下文章
AttributeError: Can‘t get attribute ‘SPPF‘ on <module ‘models.common‘ from ‘/home/yolov5/models/comm
unittest模块在linux报错: AttributeError: module 'unittest' has no attribute 'TestRunner'(
AttributeError:模块“tensorflow”没有属性“app”
新的 Kivy 安装:AttributeError: 'module' 对象没有属性 'require'
AttributeError: Can‘t pickle local object ‘Stage.__init__.<locals>.<lambda>‘
AttributeError: Can‘t pickle local object ‘Stage.__init__.<locals>.<lambda>‘