这可以是多线程 MPI_Irecv 中最原子的“如果未收到则取消”

Posted

技术标签:

【中文标题】这可以是多线程 MPI_Irecv 中最原子的“如果未收到则取消”【英文标题】:Can this be the most Atomic "cancel if not received" in multithreaded MPI_Irecv 【发布时间】:2021-12-29 10:55:58 【问题描述】:

目前的问题嵌入在多线程设置中,其中“多个”(例如 5 个)线程在每个线程都开始使用 MPI_Irecv 作为源 MPI_ANY_SOURCE 进行侦听后工作。在退出函数之前,每个线程应该检查是否收到消息,否则取消请求以释放内存。

这里假设消息仅到达 N(例如 5)个线程之一,这里提到的问题是如果在 (1) 检查消息是否已到达和 (2 ) 如果之前的测试返回 false,则取消请求,确实应该有消息到达。

附带说明,使用单个接收器写入原子访问队列应该可以解决这个问题。但这意味着重大的代码重构,并且可能会降低性能。

问题是 MPI 标准是否提供了这个问题的答案以及它是什么,或者下面的(伪)代码是否确实足够保护。

建议的解决方案似乎很可疑,因为日志(见下文)仅显示“irecv 未捕获消息 + 无法取消相关请求”的组合。好像没有记忆。

main.cpp

//...
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (provided < MPI_THREAD_MULTIPLE) 
    error_report("[error] The MPI did not provide the requested threading behaviour.");

//...

关于相关功能。

// Start recieving 
MPI_Irecv(&buffer, 1, MPI_DOUBLE,
                      MPI_ANY_SOURCE,
                      VERTEXVAL_REQUEST_FLAG,
                      MPI_COMM_WORLD,
                      &R);

// some work goes on here ... 

// Before exiting, we check if a message arrived. 

int flag1=-437, flag2=-437; // any initialization

MPI_Status status1, status2;
status2.MPI_ERROR = -999; // again, any initialization
status1.MPI_ERROR = -999;
MPI_Test(&R, &flag1, &status1);

if (flag1 != 1)
    MPI_Cancel(&R);
    MPI_Test_cancelled(&status2, &flag2);

if ((flag1 == 1) || ((flag1!=1) && (flag2!=1))) 

    if (flag1 == 1) 
        build_answer(answer, REF, buffer, status1.MPI_SOURCE, MYPROC);
        printf("A request failed to be cancelled, we are assuming we recieved it! we computed val = %f, recieved buffer = %f ; flags12 = %d %d ;  source = %d ; tag = %d; error = %d\n",
           answer, buffer, flag1, flag2, status1.MPI_SOURCE, status1.MPI_TAG, status1.MPI_ERROR);
        std::cout << std::flush;

        MPI_Ssend(&answer, 1, MPI_DOUBLE, status1.MPI_SOURCE, (int) buffer, MPI_COMM_WORLD);

        printf("Completed!\n");
        std::cout << std::flush;

     else 
        printf("A request failed to be cancelled: will ignore it. Recieved buffer = %f ; flags12 = %d %d ;  source = %d ; tag = %d ; status error = %d\n",
           buffer, flag1, flag2, status2.MPI_SOURCE, status2.MPI_TAG, status2.MPI_ERROR);
        std::cout << std::flush;
    

这种“保护”似乎解决了程序中曾经出现的千分之一的死锁,因为以前的版本只是假设取消失败意味着消息已经到达。特别是,日志条目显示通过printf 打印的以下值。

A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 22020 ;  source = 2 ; tag = 0 ; status error = -183549351
A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ;  source = 1 ; tag = 25001 ; status error = 0
A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ;  source = 1 ; tag = 25001 ; status error = 0
A request failed to be cancelled: will ignore it. Recieved buffer = -0.000000 ; flags12 = 0 21998 ;  source = 2 ; tag = 0 ; status error = -1563532711
A request failed to be cancelled: will ignore it. Recieved buffer = 16.000000 ; flags12 = 0 0 ;  source = 0 ; tag = 25001 ; status error = 0
A request failed to be cancelled: will ignore it. Recieved buffer = 16.000000 ; flags12 = 0 0 ;  source = 0 ; tag = 25001 ; status error = 0
A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ;  source = 0 ; tag = 25001 ; status error = 0
A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 22033 ;  source = 2 ; tag = 0 ; status error = -691551655
A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ;  source = 0 ; tag = 25001 ; status error = 0
A request failed to be cancelled: will ignore it. Recieved buffer = 16.000000 ; flags12 = 0 0 ;  source = 1 ; tag = 25001 ; status error = 0
A request failed to be cancelled: will ignore it. Recieved buffer = 8.000000 ; flags12 = 0 0 ;  source = 0 ; tag = 25001 ; status error = 0
A request failed to be cancelled: will ignore it. Recieved buffer = 16.000000 ; flags12 = 0 0 ;  source = 1 ; tag = 25001 ; status error = 0
A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ;  source = 1 ; tag = 25001 ; status error = 0
A request failed to be cancelled: will ignore it. Recieved buffer = 8.000000 ; flags12 = 0 0 ;  source = 0 ; tag = 25001 ; status error = 0
A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ;  source = 1 ; tag = 25001 ; status error = 0
A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ;  source = 1 ; tag = 25001 ; status error = 0
A request failed to be cancelled: will ignore it. Recieved buffer = -0.000000 ; flags12 = 0 21998 ;  source = 2 ; tag = 0 ; status error = -1563532711
A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 22033 ;  source = 2 ; tag = 0 ; status error = -691551655

【问题讨论】:

只需使用MPI_Iprobe() 即可检查是否有消息,而无需事先接收。 @GillesGouaillardet 它将继承类似的问题:(1) 多个线程使用MPI_Iprobe() 看到可用消息,(2) 其中一些线程尝试使用MPI_Irecv() 接收它,(3) 那些申请MPI_Test()后有一个flag=0的应该取消它,但问题是:他们能知道他们没有在接收它的过程中吗?他们应该尝试测试它N 次吗? 你把事情复杂化了。只需启动一条消息接收并使用互斥锁保护它以及它的结果状态。您创建的拥塞量和复杂性远远超过互斥锁将导致的任何拥塞。其他任何事情都需要良好的分析结果。 查看MPI_MprobeMPI_Mrecv,它们正好适合您的多线程场景。不需要取消接收。 @VictorEijkhout 你提到的正是这个案例的答案。如果您愿意发布它,欢迎您。 OpenMPI 的一个简洁的源代码如下:slideshare.net/jsquyres/mpimprobe-is-good-for-you 【参考方案1】:

查看MPI_MprobeMPI_Mrecv,它们正好适合您的多线程场景。没有必要取消接收。详情见https://www.slideshare.net/jsquyres/mpimprobe-is-good-for-you

【讨论】:

以上是关于这可以是多线程 MPI_Irecv 中最原子的“如果未收到则取消”的主要内容,如果未能解决你的问题,请参考以下文章

高级java必须清楚的概念:原子性可见性有序性

多线程编程-----线程同步

多线程之问题总结

高并发多线程安全之原子性问题CAS机制及问题解决方案

Day296.原子类 -Juc

Day296.原子类 -Juc