Torch.distributed.barrier（）如何工作

Question

我已经阅读了所有可以找到的有关torch.distributed.barrier（）的文档，但是仍然很难理解它在this script中的使用方式，非常感谢您的帮助。

因此official doc of torch.distributed.barrier说它“同步所有进程。如果async_op为False，或者如果在wait（）上调用了异步工作句柄，则该集合将阻塞进程，直到整个组进入该函数为止。”

在脚本的两个地方都使用过：

    if args.local_rank not in [-1, 0] and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

        ... (preprocesses the data and save the preprocessed data)

    if args.local_rank == 0 and not evaluate:
        torch.distributed.barrier()

Second place

    if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

        ... (loads the model and the vocabulary)

    if args.local_rank == 0:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

我在将代码中的注释与官方文档中说明的此功能的功能联系起来时遇到麻烦。如何确保只有第一个进程在两次torch.distributed.barrier（）调用之间执行代码，为什么它仅在第二个调用之前检查本地等级是否为0？

提前感谢！