[仅在多个节点上的MPI分段故障

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[仅在多个节点上的MPI分段故障相关的知识,希望对你有一定的参考价值。

因此,我目前正在建立一个控制程序的基础,该程序可以在多个树莓派上运行,这些树状结构将使用每个pi上的所有可用内核。当我使用所有内核在一个节点上测试代码时,它可以正常工作,但是使用多个节点会给我带来分段错误。

我查看了过去提出的所有类似问题,但是它们都存在同样会破坏我的代码的问题,而这些问题也只能在一个节点上发生。

完整代码:

#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <stdbool.h>
#include <time.h> 
int main(int argc, char *argv[])

        FILE *input;
        char batLine[86];   //may need to be made larger if bat commands get longer
        char sentbatch[86];
        int currentTask;
        int numTasks, rank, rc, i;
        MPI_Status stat;
        bool exitFlag = false;

        //mpi stuff
        MPI_Init(&argc,&argv);  //initilize mpi enviroment
        MPI_Comm_size(MPI_COMM_WORLD, &numTasks);
        MPI_Comm_rank(MPI_COMM_WORLD,&rank);
        //printf("Number of tasks: %d \n", numTasks);
        //printf ("MPI task %d has started...\n", rank);
        if(argc != 2)
        
            printf("Usage: batallocation *.bat");
            exit(1); //exit with 1 indicates a failure
        
        //contains file name: argv[1]
        input = fopen(argv[1],"r");

        currentTask = 0;
        if (rank ==0)
        
            while(1)
            
                if(exitFlag)
                    break; //allows to break out of while and for when no more lines exist
                char command[89] = "./";
                for(i=0; i < 4; i++) //will need to be 16 for full testing
                

                    //fgets needs to be character count of longest line + 2 or it fails
                    if(fgets(batLine,86,input) != NULL)
                    
                        printf("preview:%s\n",batLine);
                        if(i==0)
                        
                            strcat(command,batLine);
                            printf("rank0 gets: %s\n", command);
                            //system(command);
                        
                        else
                        
                            //MPI_Send(buffer,count,type,dest,tag,comm)
                            MPI_Send(batLine,85,MPI_CHAR,i,i,MPI_COMM_WORLD); 
                            printf("sent rank%d: %s\n",i,batLine);
                        
                    
                    else
                    
                        exitFlag = true; //flag to break out of while loop
                        break;
                    


                   
                //need to recieve data from other nodes here
                //put the data together in proper order
                //and only after that can the next sets be sent out

            
        
        else
        
            char command[89] = "./";
            //MPI_Recv(buffer,count,type,source,tag,comm,status)
            MPI_Recv(sentbatch,86,MPI_CHAR,0,rank,MPI_COMM_WORLD,&stat);
            //using rank as flag makes it so only the wanted rank gets sent the data
            strcat(command,sentbatch); //adds needed ./ before batch data
            printf("rank=%d recieved data:%s",rank,sentbatch);
            //system(command); //should run batch line
        
        fclose(input);
        MPI_Finalize();
        return(0);

正在传递的文件内容:


LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-008.flx spec-56321-GAC099N59V1_sp01-008.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-010.flx spec-56321-GAC099N59V1_sp01-010.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-013.flx spec-56321-GAC099N59V1_sp01-013.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-015.flx spec-56321-GAC099N59V1_sp01-015.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-018.flx spec-56321-GAC099N59V1_sp01-018.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-022.flx spec-56321-GAC099N59V1_sp01-022.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-023.flx spec-56321-GAC099N59V1_sp01-023.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-024.flx spec-56321-GAC099N59V1_sp01-024.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-025.flx spec-56321-GAC099N59V1_sp01-025.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-028.flx spec-56321-GAC099N59V1_sp01-028.nor f
LAMOSTv108 spec-56321-GAC099N59V1_sp01-029.flx spec-56321-GAC099N59V1_sp01-029.nor f

[您会注意到,我还没有做一些最终版本中要完成的事情,并且它们中有注释,以便于排除故障。主要是因为LAMOST代码运行不快,而且我不想等待它完成。

有效的命令提示符及其输出:

 $mpiexec -N 4 --host 10.0.0.3 -oversubscribe batTest2 shortpass2.bat
preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f

rank0 gets: ./LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f

preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f

sent rank1: LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f

preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f

sent rank2: LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f

preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f

sent rank3: LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f

rank=1 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
rank=3 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
rank=2 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f

Shortpass2只是同一文件,但只有前4行。理论上,我的代码应该可以使用全部16行,但是在解决当前问题之后,我将使用完整文件对其进行测试。

在多个节点上运行命令和输出:

$mpiexec -N 4 --host 10.0.0.3,10.0.0.4,10.0.0.5,10.0.0.6 -oversubscribe batTest2 shortpass.bat

preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f

rank0 gets: ./LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f

preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f

sent rank1: LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f

preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f

rank=1 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-003.flx spec-56321-GAC099N59V1_sp01-003.nor f
sent rank2: LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f

preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f

rank=2 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-004.flx spec-56321-GAC099N59V1_sp01-004.nor f
sent rank3: LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f

preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f

rank=3 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-005.flx spec-56321-GAC099N59V1_sp01-005.nor f
sent rank4: LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f

preview:LAMOSTv108 spec-56321-GAC099N59V1_sp01-008.flx spec-56321-GAC099N59V1_sp01-008.nor f

rank=4 recieved data:LAMOSTv108 spec-56321-GAC099N59V1_sp01-006.flx spec-56321-GAC099N59V1_sp01-006.nor f
[node2:27622] *** Process received signal ***
[node2:27622] Signal: Segmentation fault (11)
[node2:27622] Signal code: Address not mapped (1)
[node2:27622] Failing at address: (nil)
[node2:27622] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
corrupted double-linked list
Aborted

有时,它会在完全中止之前成功达到第5位,并且同一错误消息将有多个实例。此外,Open MPI已安装了多线程支持,因此这不是问题。这是我第一次使用MPI,但这不是整个项目的第一部分,我已经对MPI进行了大量研究,甚至可以做到这一点。

我知道这不是我的数组引起的,因为那也会在node1上中断。所有的pis都是相同的,因此数组引起分段错误是没有意义的。 (尽管我承认我在该项目的不同部分上多次遇到该问题,因为我更习惯于Java和C#处理数组的方式)

Edit:我检查了是否可以在其他节点之一的4个内核上运行它,并且工作正常,并产生与node1相同的输出。这样可以确认这不是仅在其他节点上发生的阵列问题。还为预览打印输出添加了代码中缺少的行。

答案

不确定这是否是[[the问题,但肯定是a问题:

您正在阅读,然后从batLine中发送85个字符:

char batLine[86]; //fgets needs to be character count of longest line + 2 or it fails if(fgets(batLine,86,input) != NULL) // ... MPI_Send(batLine,85,MPI_CHAR,i,i,MPI_COMM_WORLD); // ...

鉴于batLine[]为86个元素,并且LAMOSTv108 spec-56321-GAC099N59V1_sp01-001.flx spec-56321-GAC099N59V1_sp01-001.nor f\n为85个字符长,您发送的字符串不包含第86个数组元素中的\0终止符。

在接收方,您有:

char sentbatch[86]; char command[89] = "./"; // ... MPI_Recv(sentbatch,86,MPI_CHAR,0,rank,MPI_COMM_WORLD,&stat); strcat(command,sentbatch); // ...

sentbatch从未初始化,因此最初它包含垃圾。由于所有传入消息的长度均为85个字符,因此第86个字符永远不会被覆盖,并且它最初会保留任何垃圾。因此,如果不是\0,则strcat()将继续从第[85]个字符之后的sentbatch中读取垃圾并追加到command。由于commandsentbatch都在堆栈中,因此读取将继续进行,直到读取到堆栈中某个地方的0x00,此时超过command末尾的写入将破坏其他局部变量,甚至堆栈框架稍后可能导致段错误,或者直到到达堆栈末尾,这肯定会导致段错误。有时它在某些级别上起作用只是出于偶然。

要么更改MPI_Send以发送86个字符,要么将sentbatch的第86个元素显式清零。或者,甚至更好的是,使用strncat(command, sentbatch, 85)附加不超过85个字符,或使用[]直接接收到command

MPI_Recv(&command[2],86,MPI_CHAR,0,rank,MPI_COMM_WORLD,&stat);

[char command[89] = "./";command[]填充\0的其余87个元素,因此在这种情况下,终结符没有问题。

以上是关于[仅在多个节点上的MPI分段故障的主要内容,如果未能解决你的问题,请参考以下文章

分段故障删除节点

链表的分段错误 - 仅在使用多个时

使用 MPI_Bcast 时出现 MPI 分段错误

MPI Send 给出分段错误

MPI calloc 导致分段错误

为啥 MPI_Barrier 在 C++ 中会导致分段错误