[仅在多个节点上的MPI分段故障
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[仅在多个节点上的MPI分段故障相关的知识,希望对你有一定的参考价值。
因此,我目前正在建立一个控制程序的基础,该程序可以在多个树莓派上运行,这些树状结构将使用每个pi上的所有可用内核。当我使用所有内核在一个节点上测试代码时,它可以正常工作,但是使用多个节点会给我带来分段错误。
我查看了过去提出的所有类似问题,但是它们都存在同样会破坏我的代码的问题,而这些问题也只能在一个节点上发生。
完整代码:
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdbool.h>
#include <time.h>
int main(int argc, char *argv[])
FILE *input;
char batLine[86]; //may need to be made larger if bat commands get longer
char sentbatch[86];
int currentTask;
int numTasks, rank, rc, i;
MPI_Status stat;
bool exitFlag = false;
//mpi stuff
MPI_Init(&argc,&argv); //initilize mpi enviroment
MPI_Comm_size(MPI_COMM_WORLD, &numTasks);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
//printf("Number of tasks: %d \n", numTasks);
//printf ("MPI task %d has started...\n", rank);
if(argc != 2)
printf("Usage: batallocation *.bat");
exit(1); //exit with 1 indicates a failure
//contains file name: argv[1]
input = fopen(argv[1],"r");
currentTask = 0;
if (rank ==0)
while(1)
if(exitFlag)
break; //allows to break out of while and for when no more lines exist
char command[89] = "./";
for(i=0; i < 4; i++) //will need to be 16 for full testing
//fgets needs to be character count of longest line + 2 or it fails
if(fgets(batLine,86,input) != NULL)
printf("preview:%s\n",batLine);
if(i==0)
strcat(command,batLine);
printf("rank0 gets: %s\n", command);
//system(command);
else
//MPI_Send(buffer,count,type,dest,tag,comm)
MPI_Send(batLine,85,MPI_CHAR,i,i,MPI_COMM_WORLD);
printf("sent rank%d: %s\n",i,batLine);
else
exitFlag = true; //flag to break out of while loop
break;
//need to recieve data from other nodes here
//put the data together in proper order
//and only after that can the next sets be sent out
else
char command[89] = "./";
//MPI_Recv(buffer,count,type,source,tag,comm,status)
MPI_Recv(sentbatch,86,MPI_CHAR,0,rank,MPI_COMM_WORLD,&stat);
//using rank as flag makes it so only the wanted rank gets sent the data
strcat(command,sentbatch); //adds needed ./ before batch data
printf("rank=%d recieved data:%s",rank,sentbatch);
//system(command); //should run batch line
fclose(input);
MPI_Finalize();
return(0);
正在传递的文件内容:
LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-008.flx spec-56321-GAC099N59V1_sp01-008.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-010.flx spec-56321-GAC099N59V1_sp01-010.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-013.flx spec-56321-GAC099N59V1_sp01-013.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-015.flx spec-56321-GAC099N59V1_sp01-015.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-018.flx spec-56321-GAC099N59V1_sp01-018.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-022.flx spec-56321-GAC099N59V1_sp01-022.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-023.flx spec-56321-GAC099N59V1_sp01-023.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-024.flx spec-56321-GAC099N59V1_sp01-024.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-025.flx spec-56321-GAC099N59V1_sp01-025.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-028.flx spec-56321-GAC099N59V1_sp01-028.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-029.flx spec-56321-GAC099N59V1_sp01-029.nor f
[您会注意到,我还没有做一些最终版本中要完成的事情,并且它们中有注释,以便于排除故障。主要是因为LAMOST代码运行不快,而且我不想等待它完成。
有效的命令提示符及其输出:
$mpiexec -N 4 --host 10.0.0.3 -oversubscribe batTest2 shortpass2.bat
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
rank0 gets: ./LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
sent rank1: LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
sent rank2: LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
sent rank3: LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
rank=1 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
rank=3 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
rank=2 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
Shortpass2只是同一文件,但只有前4行。理论上,我的代码应该可以使用全部16行,但是在解决当前问题之后,我将使用完整文件对其进行测试。
在多个节点上运行命令和输出:
$mpiexec -N 4 --host 10.0.0.3,10.0.0.4,10.0.0.5,10.0.0.6 -oversubscribe batTest2 shortpass.bat
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
rank0 gets: ./LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
sent rank1: LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
rank=1 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
sent rank2: LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
rank=2 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
sent rank3: LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
rank=3 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
sent rank4: LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-008.flx spec-56321-GAC099N59V1_sp01-008.nor f
rank=4 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
[node2:27622] *** Process received signal ***
[node2:27622] Signal: Segmentation fault (11)
[node2:27622] Signal code: Address not mapped (1)
[node2:27622] Failing at address: (nil)
[node2:27622] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
corrupted double-linked list
Aborted
有时,它会在完全中止之前成功达到第5位,并且同一错误消息将有多个实例。此外,Open MPI已安装了多线程支持,因此这不是问题。这是我第一次使用MPI,但这不是整个项目的第一部分,我已经对MPI进行了大量研究,甚至可以做到这一点。
我知道这不是我的数组引起的,因为那也会在node1上中断。所有的pis都是相同的,因此数组引起分段错误是没有意义的。 (尽管我承认我在该项目的不同部分上多次遇到该问题,因为我更习惯于Java和C#处理数组的方式)
Edit:我检查了是否可以在其他节点之一的4个内核上运行它,并且工作正常,并产生与node1相同的输出。这样可以确认这不是仅在其他节点上发生的阵列问题。还为预览打印输出添加了代码中缺少的行。
不确定这是否是[[the问题,但肯定是a问题:
您正在阅读,然后从batLine
中发送85个字符:char batLine[86];
//fgets needs to be character count of longest line + 2 or it fails
if(fgets(batLine,86,input) != NULL)
// ...
MPI_Send(batLine,85,MPI_CHAR,i,i,MPI_COMM_WORLD);
// ...
鉴于batLine[]
为86个元素,并且LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f\n
为85个字符长,您发送的字符串不包含第86个数组元素中的\0
终止符。在接收方,您有:
char sentbatch[86]; char command[89] = "./"; // ... MPI_Recv(sentbatch,86,MPI_CHAR,0,rank,MPI_COMM_WORLD,&stat); strcat(command,sentbatch); // ...
sentbatch
从未初始化,因此最初它包含垃圾。由于所有传入消息的长度均为85个字符,因此第86个字符永远不会被覆盖,并且它最初会保留任何垃圾。因此,如果不是\0
,则strcat()
将继续从第[85]个字符之后的sentbatch
中读取垃圾并追加到command
。由于command
和sentbatch
都在堆栈中,因此读取将继续进行,直到读取到堆栈中某个地方的0x00
,此时超过command
末尾的写入将破坏其他局部变量,甚至堆栈框架稍后可能导致段错误,或者直到到达堆栈末尾,这肯定会导致段错误。有时它在某些级别上起作用只是出于偶然。要么更改
MPI_Send
以发送86个字符,要么将sentbatch
的第86个元素显式清零。或者,甚至更好的是,使用strncat(command, sentbatch, 85)
附加不超过85个字符,或使用[]直接接收到command
中
MPI_Recv(&command[2],86,MPI_CHAR,0,rank,MPI_COMM_WORLD,&stat);
[char command[89] = "./";
用command[]
填充\0
的其余87个元素,因此在这种情况下,终结符没有问题。
以上是关于[仅在多个节点上的MPI分段故障的主要内容,如果未能解决你的问题,请参考以下文章