mpi 进程在信号 11 上退出

Posted

技术标签:

【中文标题】mpi 进程在信号 11 上退出【英文标题】:mpi process exitting on signal 11 【发布时间】:2020-01-22 06:21:19 【问题描述】:

我有一个简单的 mpi 程序来执行 MPI_Scatter 和 MPI_Gather。它接受要发送到每个进程的元素的参数数量,然后使用 MPI_Scatter 和 MPI_Gather 对不同进程的随机数进行平均。

float *create_rand_nums(int num_elements)
    float *rand_nums = (float *)malloc(sizeof(float) * num_elements);
    assert(rand_nums != NULL);
    int i;
    for (i = 0 ; i < num_elements; i++)
        *(rand_nums+i) = (rand()/ (float)RAND_MAX);
        printf("Create_rand_nums func val %f \n", *(rand_nums+i));
        //printf("Create_rand_nums func address %f \n", (rand_nums+i));
    
    return rand_nums;


// Computes the average of an array of numbers
float compute_avg(float *array, int num_elements)
    float sum = 0.f;
    int i;
    for (i = 0; i < num_elements; i++) 
        sum += array[i];
    
    return sum / num_elements;



int main(int argc, char *argv[])

if (argc != 2) 
    fprintf(stderr, "Usage: avg num_elements_per_proc\n");
    exit(1);
  

  int num_elements_per_proc = atoi(argv[0]);
  // Seed the random number generator to get different results each time
  srand(time(NULL));

MPI_Init(NULL, NULL); 
  int world_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  int world_size;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

  // Create a random array of elements on the root process. Its total
  // size will be the number of elements per process times the number of processes
  float *rand_nums = NULL;
  if (world_rank == 0)
   rand_nums = create_rand_nums(num_elements_per_proc * world_size);
  

  // For each process, create a buffer that will hold a subset of the entire array
  float *sub_rand_nums = (float *)malloc(sizeof(float) * num_elements_per_proc);
  assert(sub_rand_nums != NULL);

  // Scatter the random numbers from the root process to all processes in the MPI world
  MPI_Scatter(rand_nums, num_elements_per_proc, MPI_FLOAT, sub_rand_nums, num_elements_per_proc, MPI_FLOAT, 0, MPI_COMM_WORLD);

  // Compute the average of your subset
  float sub_avg = compute_avg(sub_rand_nums, num_elements_per_proc);
  int i;
  printf("Avg inside %f \n", sub_avg);

  // Gather all partial averages down to the root process
  float *sub_avgs = NULL;
  if (world_rank == 0) 
    sub_avgs = (float *)malloc(sizeof(float) * world_size);
    assert(sub_avgs != NULL);
  
  MPI_Gather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT, 0, MPI_COMM_WORLD);

  for (i = 0; i < world_size; i++) 
        printf("Avg %f \n", *(sub_avgs+i));
    

  // Now that we have all of the partial averages on the root, compute the
  // total average of all numbers. Since we are assuming each process computed
  // an average across an equal amount of elements, this computation will
  // produce the correct answer.
  if (world_rank == 0) 
    float avg = compute_avg(sub_avgs, world_size);
    printf("Avg of all elements is %f\n", avg);
    // Compute the average across the original data for comparison
    float original_data_avg =
      compute_avg(rand_nums, num_elements_per_proc * world_size);
    printf("Avg computed across original data is %f\n", original_data_avg);

  



// Clean up
  if (world_rank == 0) 
    free(rand_nums);
    free(sub_avgs);
  
  free(sub_rand_nums);

  MPI_Barrier(MPI_COMM_WORLD);
  MPI_Finalize();

在运行此代码时,我得到:

mpicc -o MPI_Scatter_MPI_Gather MPI_Scatter_MPI_Gather.cc

mpirun --oversubscribe -host localhost -np 4 ./MPI_Scatter_MPI_Gather 2

Avg inside nan 
Avg inside nan 
Avg inside nan 
Avg inside nan 
[dhcp-10-142-19] *** Process received signal ***
[dhcp-10-142-19] Signal: Segmentation fault: 11 (11)
[dhcp-10-142-19] Signal code: Address not mapped (1)
[dhcp-10-142-19] Failing at address: 0x0
[dhcp-10-142-19] [ 0] 0   libsystem_platform.dylib            0x00007fff50b65f5a _sigtramp + 26
[dhcp-10-142-19] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[dhcp-10-142-19] [ 2] 0   libdyld.dylib                       0x00007fff508e5145 start + 1
[dhcp-10-142-19] [ 3] 0   ???                                 0x0000000000000002 0x0 + 2
[dhcp-10-142-19] *** End of error message ***
[dhcp-10-142-19] *** Process received signal ***
[dhcp-10-142-19] Signal: Segmentation fault: 11 (11)
[dhcp-10-142-19] Signal code: Address not mapped (1)
[dhcp-10-142-19] Failing at address: 0x0
[dhcp-10-142-19] [ 0] 0   libsystem_platform.dylib            0x0000--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 0 on node dhcp-10-142-194-10 exited on signal 11 (Segmentation fault: 11).
--------------------------------------------------------------------------
7fff50b65f5a _sigtramp + 26
[dhcp-10-142-19] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[dhcp-10-142-19] [ 2] 0   libdyld.dylib                       0x00007fff508e5145 start + 1
[dhcp-10-142-19] [ 3] 0   ???                                 0x0000000000000002 0x0 + 2
[dhcp-10-142-19] *** End of error message ***
[dhcp-10-142-19] *** Process received signal ***
[dhcp-10-142-19] Signal: Segmentation fault: 11 (11)
[dhcp-10-142-19] Signal code: Address not mapped (1)
[dhcp-10-142-19] Failing at address: 0x0
[dhcp-10-142-19] [ 0] 0   libsystem_platform.dylib            0x00007fff50b65f5a _sigtramp + 26
[dhcp-10-142-19] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[dhcp-10-142-19] [ 2] 0   libdyld.dylib                       0x00007fff508e5145 start + 1
[dhcp-10-142-19] [ 3] 0   ???                                 0x0000000000000002 0x0 + 2
[dhcp-10-142-19] *** End of error message ***
[dhcp-10-142-19] *** Process received signal ***
[dhcp-10-142-19] Signal: Segmentation fault: 11 (11)
[dhcp-10-142-19] Signal code:  (0)
[dhcp-10-142-19] Failing at address: 0x0
[dhcp-10-142-19] [ 0] 0   libsystem_platform.dylib            0x00007fff50b65f5a _sigtramp + 26
[dhcp-10-142-19] [ 1] 0   ???                                 0x0000000000000000 0x0 + 0
[dhcp-10-142-19] [ 2] 0   libdyld.dylib                       0x00007fff508e5145 start + 1
[dhcp-10-142-19] [ 3] 0   ???                                 0x0000000000000002 0x0 + 2
[dhcp-10-142-19] *** End of error message ***

我想先在一台 2 核的机器上运行这个程序,然后再在多台机器上运行。但即使在一台机器上它也失败了。请帮忙

【问题讨论】:

atoi(argv1]) 并且仅在排名 0 上打印 sub_avgs 【参考方案1】:

正如Gilles Gouaillardet's comment 中指出的那样,atoi(argv[0]) 格式错误,sub-avgs 应该只打印在等级 0 上。

【讨论】:

以上是关于mpi 进程在信号 11 上退出的主要内容,如果未能解决你的问题,请参考以下文章

为啥 MPI 程序以退出代码 134(信号 6)终止?

进程以退出代码 139 结束(被信号 11 中断:SIGSEGV)

Android Studio 模拟器:进程以退出代码 139 完成(被信号 11 中断:SIGSEGV)

Android Studio Emulator崩溃:进程以退出代码139结束(由信号11中断:SIGSEGV)

在 pycharm 上使用 keras-tensorflow 的 3D CNN(进程以退出代码 137 完成(被信号 9:SIGKILL 中断))

Nginx进程信号管理