分布式训练---单机多卡与多机多卡组网(飞桨paddle2.0+)更加推荐spawn方式!
Posted 汀、
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了分布式训练---单机多卡与多机多卡组网(飞桨paddle2.0+)更加推荐spawn方式!相关的知识,希望对你有一定的参考价值。
1. 单机多卡启动并行训练
飞桨2.0增加paddle.distributed.spawn函数来启动单机多卡训练,同时原有的paddle.distributed.launch的方式依然保留。
- paddle.distributed.launch通过指定启动的程序文件,以文件为单位启动多进程来实现多卡同步训练。以前在aistudio脚本任务说明里,就是推荐这种方法启动多卡任务。launch这种方式对进程管理要求较高。
- paddle.distributed.spawn是以function函数为单位启动多进程来实现多卡同步的,可以更好地控制进程,在日志打印、训练退出时更友好。这是当前推荐的用法。
下面分别介绍这两种方法。
1.1单机多卡启动方式1、launch启动
1.1.1使用高层API的场景
-
当调用paddle.Model高层API来实现训练时,想要启动单机多卡训练非常简单,代码不需要做任何修改,只需要在启动时增加一下参数-m paddle.distributed.launch。
#单机单卡启动,默认使用第0号卡 $ python train.py #单机多卡启动,默认使用当前可见的所有卡 $ python -m paddle.distributed.launch train.py #单机多卡启动,设置当前使用的第0号和第1号卡 $ python -m paddle.distributed.launch --selected_gpus='0,1' train.py #单机多卡启动,设置当前使用第0号和第1号卡 $ export CUDA_VISIABLE_DEVICES='0,1' $ python -m paddle.distributed.launch train.py
-
下面是一个高阶API的例子代码,直接执行cell代码框,就会在根目录生成hapitrain.py文件,然后就可以使用python来启动这个训练了。
%%writefile hapitrain.py
import paddle
from paddle.vision.transforms import ToTensor
train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())
lenet = paddle.vision.models.LeNet()
# Mnist继承paddle.nn.Layer属于Net,model包含了训练功能
model = paddle.Model(lenet)
# 设置训练模型所需的optimizer, loss, metric
model.prepare(
paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()),
paddle.nn.CrossEntropyLoss(),
paddle.metric.Accuracy(topk=(1, 2))
)
# 启动训练
model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400)
# 启动评估
model.evaluate(test_dataset, log_freq=100, batch_size=64)
单机单卡启动,默认使用第0号卡
# 单机单卡启动,默认使用第0号卡
!python hapitrain.py
Begin to download
Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/train-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/train-labels-idx1-ubyte.gz
Begin to download
........
Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-images-idx3-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-images-idx3-ubyte.gz
Begin to download
Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-labels-idx1-ubyte.gz
Begin to download
..
Download finished
W0628 15:25:11.488023 114 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:25:11.614305 114 device_context.cc:372] device: 0, cuDNN Version: 7.6.
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/1
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
return (isinstance(seq, collections.Sequence) and
step 400/938 - loss: 0.0555 - acc_top1: 0.9217 - acc_top2: 0.9649 - 50ms/step
step 800/938 - loss: 0.0300 - acc_top1: 0.9454 - acc_top2: 0.9782 - 39ms/step
step 938/938 - loss: 0.0213 - acc_top1: 0.9498 - acc_top2: 0.9803 - 38ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 100/157 - loss: 0.0057 - acc_top1: 0.9731 - acc_top2: 0.9927 - 28ms/step
step 157/157 - loss: 0.0013 - acc_top1: 0.9785 - acc_top2: 0.9945 - 28ms/step
Eval samples: 10000
单机多卡启动,默认使用当前可见的所有卡
# 单机多卡启动,默认使用当前可见的所有卡
!python -m paddle.distributed.launch hapitrain.py
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def convert_to_list(value, n, name, dtype=np.int):
----------- Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers:
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers:
training_script: hapitrain.py
training_script_args: []
worker_num: None
workers:
------------------------------------------------
WARNING 2021-06-28 15:26:17,473 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-06-28 15:26:17,475 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:35079 |
| PADDLE_TRAINERS_NUM 1 |
| PADDLE_TRAINER_ENDPOINTS 127.0.0.1:35079 |
| FLAGS_selected_gpus 0 |
+=======================================================================================+
INFO 2021-06-28 15:26:17,475 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def convert_to_list(value, n, name, dtype=np.int):
W0628 15:26:24.305920 285 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:26:24.311555 285 device_context.cc:372] device: 0, cuDNN Version: 7.6.
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/1
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
return (isinstance(seq, collections.Sequence) and
step 400/938 - loss: 0.0586 - acc_top1: 0.9130 - acc_top2: 0.9611 - 38ms/step
step 800/938 - loss: 0.0288 - acc_top1: 0.9397 - acc_top2: 0.9759 - 39ms/step
step 938/938 - loss: 0.0545 - acc_top1: 0.9448 - acc_top2: 0.9785 - 40ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 100/157 - loss: 0.0035 - acc_top1: 0.9677 - acc_top2: 0.9911 - 36ms/step
step 157/157 - loss: 0.0057 - acc_top1: 0.9723 - acc_top2: 0.9929 - 36ms/step
Eval samples: 10000
INFO 2021-06-28 15:27:26,569 launch.py:240] Local processes completed.
单机多卡启动,设置当前使用第0号和第1号卡 aistudio单卡也可以运行,可以看到launch的容错率较高
# 单机多卡启动,设置当前使用第0号和第1号卡 aistudio单卡也可以运行,可以看到launch的容错率较高
!CUDA_VISIABLE_DEVICES='0,1' && python -m paddle.distributed.launch hapitrain.py
----------- Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers:
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers:
training_script: hapitrain.py
training_script_args: []
worker_num: None
workers:
------------------------------------------------
WARNING 2021-06-28 15:28:10,632 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-06-28 15:28:10,637 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------以上是关于分布式训练---单机多卡与多机多卡组网(飞桨paddle2.0+)更加推荐spawn方式!的主要内容,如果未能解决你的问题,请参考以下文章