分布式训练---单机多卡与多机多卡组网(飞桨paddle2.0+)更加推荐spawn方式!

Posted 汀、

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了分布式训练---单机多卡与多机多卡组网(飞桨paddle2.0+)更加推荐spawn方式!相关的知识,希望对你有一定的参考价值。

1. 单机多卡启动并行训练

飞桨2.0增加paddle.distributed.spawn函数来启动单机多卡训练,同时原有的paddle.distributed.launch的方式依然保留。

  • paddle.distributed.launch通过指定启动的程序文件,以文件为单位启动多进程来实现多卡同步训练。以前在aistudio脚本任务说明里,就是推荐这种方法启动多卡任务。launch这种方式对进程管理要求较高。
  • paddle.distributed.spawn是以function函数为单位启动多进程来实现多卡同步的,可以更好地控制进程,在日志打印、训练退出时更友好。这是当前推荐的用法。

下面分别介绍这两种方法。

1.1单机多卡启动方式1、launch启动

1.1.1使用高层API的场景

  • 当调用paddle.Model高层API来实现训练时,想要启动单机多卡训练非常简单,代码不需要做任何修改,只需要在启动时增加一下参数-m paddle.distributed.launch。

      #单机单卡启动,默认使用第0号卡
      $ python train.py
      
      #单机多卡启动,默认使用当前可见的所有卡
      $ python -m paddle.distributed.launch train.py
    
      #单机多卡启动,设置当前使用的第0号和第1号卡
      $ python -m paddle.distributed.launch --selected_gpus='0,1' train.py
    
      #单机多卡启动,设置当前使用第0号和第1号卡
      $ export CUDA_VISIABLE_DEVICES='0,1'
      $ python -m paddle.distributed.launch train.py
    
  • 下面是一个高阶API的例子代码,直接执行cell代码框,就会在根目录生成hapitrain.py文件,然后就可以使用python来启动这个训练了。

%%writefile hapitrain.py 
import paddle 
from paddle.vision.transforms import ToTensor

train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=ToTensor())
test_dataset = paddle.vision.datasets.MNIST(mode='test', transform=ToTensor())
lenet = paddle.vision.models.LeNet()

# Mnist继承paddle.nn.Layer属于Net,model包含了训练功能
model = paddle.Model(lenet)

# 设置训练模型所需的optimizer, loss, metric
model.prepare(
    paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()),
    paddle.nn.CrossEntropyLoss(),
    paddle.metric.Accuracy(topk=(1, 2))
    )

# 启动训练
model.fit(train_dataset, epochs=1, batch_size=64, log_freq=400)

# 启动评估
model.evaluate(test_dataset, log_freq=100, batch_size=64)

单机单卡启动,默认使用第0号卡

# 单机单卡启动,默认使用第0号卡
!python hapitrain.py
Begin to download

Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/train-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/train-labels-idx1-ubyte.gz 
Begin to download
........
Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-images-idx3-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-images-idx3-ubyte.gz 
Begin to download

Download finished
Cache file /home/aistudio/.cache/paddle/dataset/mnist/t10k-labels-idx1-ubyte.gz not found, downloading https://dataset.bj.bcebos.com/mnist/t10k-labels-idx1-ubyte.gz 
Begin to download
..
Download finished
W0628 15:25:11.488023   114 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:25:11.614305   114 device_context.cc:372] device: 0, cuDNN Version: 7.6.
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/1
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  return (isinstance(seq, collections.Sequence) and
step 400/938 - loss: 0.0555 - acc_top1: 0.9217 - acc_top2: 0.9649 - 50ms/step
step 800/938 - loss: 0.0300 - acc_top1: 0.9454 - acc_top2: 0.9782 - 39ms/step
step 938/938 - loss: 0.0213 - acc_top1: 0.9498 - acc_top2: 0.9803 - 38ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 100/157 - loss: 0.0057 - acc_top1: 0.9731 - acc_top2: 0.9927 - 28ms/step
step 157/157 - loss: 0.0013 - acc_top1: 0.9785 - acc_top2: 0.9945 - 28ms/step
Eval samples: 10000

单机多卡启动,默认使用当前可见的所有卡

# 单机多卡启动,默认使用当前可见的所有卡
!python -m paddle.distributed.launch hapitrain.py
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def convert_to_list(value, n, name, dtype=np.int):
-----------  Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers: 
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers: 
training_script: hapitrain.py
training_script_args: []
worker_num: None
workers: 
------------------------------------------------
WARNING 2021-06-28 15:26:17,473 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-06-28 15:26:17,475 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug): 
    +=======================================================================================+
    |                        Distributed Envs                      Value                    |
    +---------------------------------------------------------------------------------------+
    |                       PADDLE_TRAINER_ID                        0                      |
    |                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:35079               |
    |                     PADDLE_TRAINERS_NUM                        1                      |
    |                PADDLE_TRAINER_ENDPOINTS                 127.0.0.1:35079               |
    |                     FLAGS_selected_gpus                        0                      |
    +=======================================================================================+

INFO 2021-06-28 15:26:17,475 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def convert_to_list(value, n, name, dtype=np.int):
W0628 15:26:24.305920   285 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W0628 15:26:24.311555   285 device_context.cc:372] device: 0, cuDNN Version: 7.6.
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/1
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  return (isinstance(seq, collections.Sequence) and
step 400/938 - loss: 0.0586 - acc_top1: 0.9130 - acc_top2: 0.9611 - 38ms/step
step 800/938 - loss: 0.0288 - acc_top1: 0.9397 - acc_top2: 0.9759 - 39ms/step
step 938/938 - loss: 0.0545 - acc_top1: 0.9448 - acc_top2: 0.9785 - 40ms/step
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 100/157 - loss: 0.0035 - acc_top1: 0.9677 - acc_top2: 0.9911 - 36ms/step
step 157/157 - loss: 0.0057 - acc_top1: 0.9723 - acc_top2: 0.9929 - 36ms/step
Eval samples: 10000
INFO 2021-06-28 15:27:26,569 launch.py:240] Local processes completed.

单机多卡启动,设置当前使用第0号和第1号卡 aistudio单卡也可以运行,可以看到launch的容错率较高

# 单机多卡启动,设置当前使用第0号和第1号卡 aistudio单卡也可以运行,可以看到launch的容错率较高
!CUDA_VISIABLE_DEVICES='0,1' && python -m paddle.distributed.launch hapitrain.py
-----------  Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers: 
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers: 
training_script: hapitrain.py
training_script_args: []
worker_num: None
workers: 
------------------------------------------------
WARNING 2021-06-28 15:28:10,632 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-06-28 15:28:10,637 launch_utils.py:471] Local start 1 processes. First process distributed environment info (Only For Debug): 
    +=======================================================================================+
    |                        Distributed Envs                      Value                    |
    +---------------------------------

以上是关于分布式训练---单机多卡与多机多卡组网(飞桨paddle2.0+)更加推荐spawn方式!的主要内容,如果未能解决你的问题,请参考以下文章

TensorFlow分布式部署多机多卡

Pytorch Multi-GPU原理与实现(单机多卡)

学习笔记TF061:分布式TensorFlow,分布式原理最佳实践

深度学习多机多卡解决方案-purine

中文预训练模型ERNIE2.0模型下载及安装

简单介绍pytorch中分布式训练DDP使用 (结合实例,快速入门)