为啥我必须在析构函数中调用 MPI.Finalize() ?

Posted

技术标签:

【中文标题】为啥我必须在析构函数中调用 MPI.Finalize() ?【英文标题】:Why do I have to call MPI.Finalize() inside the destructor?为什么我必须在析构函数中调用 MPI.Finalize() ? 【发布时间】:2022-01-17 01:25:25 【问题描述】:

我目前正在尝试了解 mpi4py。我设置了mpi4py.rc.initialize = Falsempi4py.rc.finalize = False,因为我不明白我们为什么要自动初始化和完成。默认行为是在导入 MPI 时调用 MPI.Init()。我认为这样做的原因是因为对于每个等级,python 解释器的一个实例都在运行,并且每个实例都将运行整个脚本,但这只是猜测。最后,我喜欢明确表达。

现在这引入了一些问题。我有这个代码

import numpy as np
import mpi4py
mpi4py.rc.initialize = False  # do not initialize MPI automatically
mpi4py.rc.finalize = False # do not finalize MPI automatically

from mpi4py import MPI # import the 'MPI' module
import h5py

class DataGenerator:
    def __init__(self, filename, N, comm):
        self.comm = comm
        self.file = h5py.File(filename, 'w', driver="mpio", comm=comm)

        # Create datasets
        self.data_ds= self.file.create_dataset("indices", (N,1), dtype='i')

    def __del__(self):
        self.file.close()
        

if __name__=='__main__':
    MPI.Init()
    world = MPI.COMM_WORLD
    world_rank = MPI.COMM_WORLD.rank

    filename = "test.hdf5"
    N = 10
    data_gen = DataGenerator(filename, N, comm=world)

    MPI.Finalize()

导致

$ mpiexec -n 4 python test.py 
*** The MPI_Barrier() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort. [eu-login-04:01559] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** The MPI_Barrier() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort. [eu-login-04:01560] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------------------------- Primary job  terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
*** The MPI_Barrier() function was called after MPI_FINALIZE was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort. [eu-login-04:01557] Local abort after MPI_FINALIZE started completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
-------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

  Process name: [[15050,1],3]   Exit code:    1
--------------------------------------------------------------------------

我对这里发生的事情有点困惑。如果我将MPI.Finalize() 移动到析构函数的末尾,它可以正常工作。

并不是说我也使用 h5py,它使用 MPI 进行并行化。所以我这里有一个并行文件 IO。并不是说 h5py 需要使用 MPI 支持进行编译。您可以通过设置虚拟环境并运行pip install --no-binary=h5py h5py 轻松做到这一点。

【问题讨论】:

【参考方案1】:

按照你写的方式,data_gen 一直存在,直到 main 函数返回。但是你在函数中调用MPI.Finalize。因此析构函数在 finalize 之后运行。 h5py.File.close 方法似乎在内部调用 MPI.Comm.Barrier。禁止在 finalize 之后调用它。

如果您想要显式控制,请确保在调用 MPI.Finalize 之前销毁所有对象。当然,如果某些对象只被垃圾收集器销毁,而不是引用计数器,那么即使这样也可能不够。

为避免这种情况,请使用上下文管理器而不是析构函数。

class DataGenerator:
    def __init__(self, filename, N, comm):
        self.comm = comm
        self.file = h5py.File(filename, 'w', driver="mpio", comm=comm)

        # Create datasets
        self.data_ds= self.file.create_dataset("indices", (N,1), dtype='i')

    def __enter__(self):
        return self

    def __exit__(self, type, value, traceback):
        self.file.close()


if __name__=='__main__':
    MPI.Init()
    world = MPI.COMM_WORLD
    world_rank = MPI.COMM_WORLD.rank

    filename = "test.hdf5"
    N = 10
    with DataGenerator(filename, N, comm=world) as data_gen:
        pass
    MPI.Finalize()

【讨论】:

啊,现在说得通了。感谢您指出上下文管理器。

以上是关于为啥我必须在析构函数中调用 MPI.Finalize() ?的主要内容,如果未能解决你的问题,请参考以下文章

检查指针在析构函数中不为空[重复]

glDeleteBuffers() 在析构函数调用期间崩溃

在析构函数中调用“delete”运算符时出现编译器错误“编译器堆空间不足”

在析构函数中终止当前线程

析构函数为啥能释放对象内存?

如果debug调试的时候中断总是停在析构函数的delete[] p上