这可以是多线程 MPI_Irecv 中最原子的“如果未收到则取消”
Posted
技术标签:
【中文标题】这可以是多线程 MPI_Irecv 中最原子的“如果未收到则取消”【英文标题】:Can this be the most Atomic "cancel if not received" in multithreaded MPI_Irecv 【发布时间】:2021-12-29 10:55:58 【问题描述】:目前的问题嵌入在多线程设置中,其中“多个”(例如 5 个)线程在每个线程都开始使用 MPI_Irecv
作为源 MPI_ANY_SOURCE
进行侦听后工作。在退出函数之前,每个线程应该检查是否收到消息,否则取消请求以释放内存。
这里假设消息仅到达 N(例如 5)个线程之一,这里提到的问题是如果在 (1) 检查消息是否已到达和 (2 ) 如果之前的测试返回 false,则取消请求,确实应该有消息到达。
附带说明,使用单个接收器写入原子访问队列应该可以解决这个问题。但这意味着重大的代码重构,并且可能会降低性能。
问题是 MPI 标准是否提供了这个问题的答案以及它是什么,或者下面的(伪)代码是否确实足够保护。
建议的解决方案似乎很可疑,因为日志(见下文)仅显示“irecv 未捕获消息 + 无法取消相关请求”的组合。好像没有记忆。
在main.cpp
//...
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (provided < MPI_THREAD_MULTIPLE)
error_report("[error] The MPI did not provide the requested threading behaviour.");
//...
关于相关功能。
// Start recieving
MPI_Irecv(&buffer, 1, MPI_DOUBLE,
MPI_ANY_SOURCE,
VERTEXVAL_REQUEST_FLAG,
MPI_COMM_WORLD,
&R);
// some work goes on here ...
// Before exiting, we check if a message arrived.
int flag1=-437, flag2=-437; // any initialization
MPI_Status status1, status2;
status2.MPI_ERROR = -999; // again, any initialization
status1.MPI_ERROR = -999;
MPI_Test(&R, &flag1, &status1);
if (flag1 != 1)
MPI_Cancel(&R);
MPI_Test_cancelled(&status2, &flag2);
if ((flag1 == 1) || ((flag1!=1) && (flag2!=1)))
if (flag1 == 1)
build_answer(answer, REF, buffer, status1.MPI_SOURCE, MYPROC);
printf("A request failed to be cancelled, we are assuming we recieved it! we computed val = %f, recieved buffer = %f ; flags12 = %d %d ; source = %d ; tag = %d; error = %d\n",
answer, buffer, flag1, flag2, status1.MPI_SOURCE, status1.MPI_TAG, status1.MPI_ERROR);
std::cout << std::flush;
MPI_Ssend(&answer, 1, MPI_DOUBLE, status1.MPI_SOURCE, (int) buffer, MPI_COMM_WORLD);
printf("Completed!\n");
std::cout << std::flush;
else
printf("A request failed to be cancelled: will ignore it. Recieved buffer = %f ; flags12 = %d %d ; source = %d ; tag = %d ; status error = %d\n",
buffer, flag1, flag2, status2.MPI_SOURCE, status2.MPI_TAG, status2.MPI_ERROR);
std::cout << std::flush;
这种“保护”似乎解决了程序中曾经出现的千分之一的死锁,因为以前的版本只是假设取消失败意味着消息已经到达。特别是,日志条目显示通过printf
打印的以下值。
A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 22020 ; source = 2 ; tag = 0 ; status error = -183549351 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ; source = 1 ; tag = 25001 ; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ; source = 1 ; tag = 25001 ; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = -0.000000 ; flags12 = 0 21998 ; source = 2 ; tag = 0 ; status error = -1563532711 A request failed to be cancelled: will ignore it. Recieved buffer = 16.000000 ; flags12 = 0 0 ; source = 0 ; tag = 25001 ; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 16.000000 ; flags12 = 0 0 ; source = 0 ; tag = 25001 ; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ; source = 0 ; tag = 25001 ; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 22033 ; source = 2 ; tag = 0 ; status error = -691551655 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ; source = 0 ; tag = 25001 ; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 16.000000 ; flags12 = 0 0 ; source = 1 ; tag = 25001 ; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 8.000000 ; flags12 = 0 0 ; source = 0 ; tag = 25001 ; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 16.000000 ; flags12 = 0 0 ; source = 1 ; tag = 25001 ; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ; source = 1 ; tag = 25001 ; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 8.000000 ; flags12 = 0 0 ; source = 0 ; tag = 25001 ; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ; source = 1 ; tag = 25001 ; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 0 ; source = 1 ; tag = 25001 ; status error = 0 A request failed to be cancelled: will ignore it. Recieved buffer = -0.000000 ; flags12 = 0 21998 ; source = 2 ; tag = 0 ; status error = -1563532711 A request failed to be cancelled: will ignore it. Recieved buffer = 0.000000 ; flags12 = 0 22033 ; source = 2 ; tag = 0 ; status error = -691551655
【问题讨论】:
只需使用MPI_Iprobe()
即可检查是否有消息,而无需事先接收。
@GillesGouaillardet 它将继承类似的问题:(1) 多个线程使用MPI_Iprobe()
看到可用消息,(2) 其中一些线程尝试使用MPI_Irecv()
接收它,(3) 那些申请MPI_Test()
后有一个flag=0的应该取消它,但问题是:他们能知道他们没有在接收它的过程中吗?他们应该尝试测试它N
次吗?
你把事情复杂化了。只需启动一条消息接收并使用互斥锁保护它以及它的结果状态。您创建的拥塞量和复杂性远远超过互斥锁将导致的任何拥塞。其他任何事情都需要良好的分析结果。
查看MPI_Mprobe
和MPI_Mrecv
,它们正好适合您的多线程场景。不需要取消接收。
@VictorEijkhout 你提到的正是这个案例的答案。如果您愿意发布它,欢迎您。 OpenMPI 的一个简洁的源代码如下:slideshare.net/jsquyres/mpimprobe-is-good-for-you
【参考方案1】:
查看MPI_Mprobe
和MPI_Mrecv
,它们正好适合您的多线程场景。没有必要取消接收。详情见https://www.slideshare.net/jsquyres/mpimprobe-is-good-for-you
【讨论】:
以上是关于这可以是多线程 MPI_Irecv 中最原子的“如果未收到则取消”的主要内容,如果未能解决你的问题,请参考以下文章