需要帮助使用 MPI 调试并行矩阵乘法
Posted
技术标签:
【中文标题】需要帮助使用 MPI 调试并行矩阵乘法【英文标题】:Need help debugging parallel matrix multiplication using MPI 【发布时间】:2015-06-14 07:27:27 【问题描述】:我目前正在用 C 语言编写一个程序,使用 MPI 并行执行矩阵乘法。我对 C 和 MPI 很陌生,所以这是一个非常粗糙的代码。我似乎无法让我的代码工作,所以有人可以帮我通读它并帮助我了解我需要做些什么来修复它吗?
代码如下:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <mpi.h>
// code adapted from source codes from
// http://www.programiz.com/c-programming/c-multi-dimensional-arrays
// http://www.cs.hofstra.edu/~cscccl/csc145/imul.c
// GENERAL VARIABLES
int **A, **B, **AB;
int i,j,k;
int rows_A, cols_A, rows_B, cols_B;
int dimensions[3];
// MATRIX MULTIPLICATION
void matrixMult(int start, int interval)
for (i = start; i < start+interval; ++i)
for (j = 0; j < cols_B; ++j)
for (k = 0; k < cols_A; ++k)
AB[i][j] += (A[i][k] * B[k][j]);
int main(int argc, char *argv[])
// MPI VARIABLES, INITIALIZE MPI
int rank, size, interval, remainder;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0)
// READ AND WRITE MATRICES ------------------------------------
FILE *matrix1, *matrix2;
matrix1 = fopen("matrix1", "r");
fscanf(matrix1, "%d", &rows_A);
fscanf(matrix1, "%d", &cols_A);
matrix2 = fopen("matrix2", "r");
fscanf(matrix2, "%d", &rows_B);
fscanf(matrix2, "%d", &cols_B);
int dimensions[3] = rows_A, cols_A, cols_B;
/*printf("\n\nRows A = %d",rows_A);
printf("\nCols A = %d",cols_A);
printf("\n\nRows B = %d",rows_B);
printf("\nCols B = %d",cols_B);*/
// Allocate memory for matrices
int **A = malloc(rows_A * sizeof(int*));
// The cast to size_t prevents integer overflow with big matrices
A[0] = malloc((size_t)rows_A * (size_t)cols_A * sizeof(int));
for(i = 1; i < rows_A; i++)
A[i] = A[0] + i*cols_A;
int **B = malloc(rows_B * sizeof(int*));
// The cast to size_t prevents integer overflow with big matrices
B[0] = malloc((size_t)rows_B * (size_t)cols_B * sizeof(int));
for(i = 1; i < rows_A; i++)
B[i] = B[0] + i*cols_B;
int **AB = malloc(rows_A * sizeof(int*));
// The cast to size_t prevents integer overflow with big matrices
AB[0] = malloc((size_t)rows_A * (size_t)cols_B * sizeof(int));
for(i = 1; i < rows_A; i++)
AB[i] = AB[0] + i*cols_B;
/*int **A = (int **)malloc(rows_A * sizeof(int*));
for(i = 0; i < rows_A; i++)
A[i] = (int *)malloc(cols_A * sizeof(int));
int **B = (int **)malloc(rows_B * sizeof(int*));
for(i = 0; i < rows_B; i++)
B[i] = (int *)malloc(cols_B * sizeof(int));
int **AB = (int **)malloc(rows_A * sizeof(int*));
for(i = 0; i < rows_B; i++)
AB[i] = (int *)malloc(cols_B * sizeof(int));*/
// Write matrices
while(!feof(matrix1))
for(i=0;i<rows_A;i++)
for(j=0;j<cols_A;j++)
fscanf(matrix1,"%d",&A[i][j]);
while(!feof(matrix2))
for(i=0;i<rows_B;i++)
for(j=0;j<cols_B;j++)
fscanf(matrix2,"%d",&B[i][j]);
/*
// Print Matrices
printf("\n\n");
//print matrix 1
printf("Matrix A:\n");
for(i=0;i<rows_A;i++)
for(j=0;j<cols_A;j++)
printf("%d\t",A[i][j]);
printf("\n");
printf("\n");
//print matrix 2
printf("Matrix B:\n");
for(i=0;i<rows_B;i++)
for(j=0;j<cols_B;j++)
printf("%d\t",B[i][j]);
printf("\n"); */
// ------------------------------------------------------------------
// MULTIPLICATION (Parallelize here)
printf("begin rank 0\n");
interval = rows_A / size; // work per processor
remainder = rows_A % size;
// SEND B BROADCAST to all
MPI_Bcast(B, rows_B * cols_B, MPI_INT, 0, MPI_COMM_WORLD);
printf("1\n");
// SEND A, ROWS, COLS, interval to each rank
for(i=1;i<size;i++)
MPI_Send(dimensions,3,MPI_INT,i,123,MPI_COMM_WORLD);
printf("2\n");
for(i=1;i<size;i++)
MPI_Send(A[i*interval],interval*rows_A,MPI_INT,i,123,MPI_COMM_WORLD);
printf("3\n");
// ROOT MM
matrixMult(0, interval);
printf("3.5\n");
matrixMult(size * interval, remainder);
printf("4\n");
// receive AB from workers, add to current AB
for(i=1;i<size;i++)
MPI_Recv(AB[i*interval],interval*rows_A,MPI_INT,i,123,MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("5\n");
// PRINT MATRIX PRODUCT
printf("\nSum Of Matrix:\n");
for(i = 0; i < rows_A; ++i)
for(j = 0; j < cols_B; ++j)
printf("%d\t",AB[i][j]);
if(j == cols_B - 1)/* To display matrix sum in order. */
printf("\n");
// CLOSE FILES
fclose(matrix1);
fclose(matrix2);
else // WORKER NODES
printf("bring workers\n");
// RECEIVE B BROADCAST
MPI_Bcast(B, rows_B * cols_B, MPI_INT, 0, MPI_COMM_WORLD);
printf("a\n");
// RECEIVE A, INTERVAL
MPI_Recv(dimensions,3,MPI_INT,0,123, MPI_COMM_WORLD,MPI_STATUS_IGNORE);
printf("b\n");
rows_A = dimensions[0];
cols_A = dimensions[1];
cols_B = dimensions[2];
printf("c\n");
MPI_Recv(A[rank*interval],interval*rows_A,MPI_INT,0,123, MPI_COMM_WORLD,MPI_STATUS_IGNORE);
printf("d\n");
// WORKER MM
matrixMult(rank*interval, interval);
printf("e\n");
// send AB to root
MPI_Send(AB[rank*interval],interval*rows_A,MPI_INT,0,123,MPI_COMM_WORLD);
printf("f\n");
// FINALIZE MPI
MPI_Finalize(); /* EXIT MPI */
我坚持了一些打印,试图了解我的代码失败的地方,看起来它到达了工人和 0 级根中的实际矩阵乘法部分。这是否意味着我的接收有问题?输入是 1 2 3 4 5 6 的 2x3 矩阵和 7 8 9 10 11 12 的 3x2 矩阵,输出如下所示:
hjiang1@cook:~/cs287/PMatrixMultiply$ make
mpicc parallelMatrixMult.c -std=c99 -lm -o parallelMatrix.out
hjiang1@cook:~/cs287/PMatrixMultiply$ mpirun --hostfile QuaCS parallelMatrix.out
No protocol specified
No protocol specified
bring workers
a
bring workers
a
bring workers
a
begin rank 0
1
2
b
c
b
c
b
c
3
d
e
d
3.5
[cook:06730] *** Process received signal ***
[cook:06730] Signal: Segmentation fault (11)
[cook:06730] Signal code: Address not mapped (1)
[cook:06730] Failing at address: 0xffffffffbbc4d600
[cook:06728] *** Process received signal ***
[cook:06728] Signal: Segmentation fault (11)
[cook:06728] Signal code: Address not mapped (1)
[cook:06728] Failing at address: 0x5d99f200
[cook:06727] *** Process received signal ***
[cook:06730] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fdaa80eccb0]
[cook:06730] [ 1] [cook:06728] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x147b55)[0x7fdaa7e65b55]
[cook:06730] [ 2] /usr/local/lib/openmpi/mca_btl_vader.so(+0x23f9)[0x7fda9e70f3f9]
[cook:06730] [ 3] /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_rndv+0x1d3)[0x7fda9e0df393]
[cook:06730] [ 4] /usr/local/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x754)[0x7fda9e0d5404]
[cook:06730] [ 5] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f910bef2cb0]
[cook:06728] [ 1] parallelMatrix.out[0x400bad]
[cook:06728] [ 2] parallelMatrix.out[0x401448]
[cook:06728] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f910bb4576d]
[cook:06728] [ 4] parallelMatrix.out[0x400a79]
[cook:06728] *** End of error message ***
/usr/local/lib/libmpi.so.1(PMPI_Send+0xf2)[0x7fdaa8368332]
[cook:06730] [ 6] parallelMatrix.out[0x401492]
[cook:06730] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fdaa7d3f76d]
[cook:06730] [ 8] parallelMatrix.out[0x400a79]
[cook:06730] *** End of error message ***
[cook:06727] Signal: Segmentation fault (11)
[cook:06727] Signal code: Address not mapped (1)
[cook:06727] Failing at address: (nil)
[cook:06727] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f73e0d09cb0]
[cook:06727] [ 1] parallelMatrix.out[0x400bad]
[cook:06727] [ 2] [cook:6729] *** An error occurred in MPI_Recv
[cook:6729] *** reported by process [1864040449,2]
[cook:6729] *** on communicator MPI_COMM_WORLD
[cook:6729] *** MPI_ERR_COUNT: invalid count argument
[cook:6729] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[cook:6729] *** and potentially your MPI job)
如果有人能提供帮助,将不胜感激。再说一次,我是 C 和 MPI 的新手,所以请耐心等待我的代码有多糟糕。
【问题讨论】:
【参考方案1】:这是我反复看到的相同错误。使用 MPI 时,请使用平面数组,即将矩阵分配为连续的内存块,而不是单独分配每一行,即:
int **A = (int **)malloc(rows_A * sizeof(int*));
for(i = 0; i < rows_A; i++)
A[i] = (int *)malloc(cols_A * sizeof(int));
你应该使用:
int **A = malloc(rows_A * sizeof(int*));
// The cast to size_t prevents integer overflow with big matrices
A[0] = malloc((size_t)rows_A * (size_t)cols_A * sizeof(int));
for(i = 1; i < rows_A; i++)
A[i] = A[0] + i*cols_A;
释放这样的矩阵如下:
free(A[0]);
free(A);
也就是说,您的代码中还有另一类错误:
MPI_Recv(A+(i*interval), ...);
MPI_Send(A+(i*interval), ...);
A
是指向每一行的指针数组。 A+i
是指向该数组的第 i 个元素的指针。因此,您传递的 MPI 不是内存中行数据的实际地址,而是指向该数据的指针的指针。正确的表达式(假设您已按照前面所述在单个块中分配内存)是:
MPI_Recv(A[i*interval], ...);
或
MPI_Recv(*(A + i*interval), ...);
换句话说,array[index]
等价于*(array + index)
而不是array + index
。
【讨论】:
这可能不是正确的地点,但我最近一直在考虑“让 mpi 更容易”:如果这种分配 2d 数组的方式如此普遍,不应该有一个简单的方法来在 MPI 中指定它? Hindexed-block 将描述大小为“cols_B”的“rows_A”数组。 25 年后,我很惊讶还没有一种“MPI-Sugar”可以为我们的贫困用户做到这一点。 人们可以很容易地构造一个数据类型来描述这种矩阵的单个实例(例如,通过使用绝对地址与MPI_BOTTOM
结合),但它只能使用对于那个特定的单个实例。由 (basic-type, offset) 元组组成的 MPI 类型模型没有解引用指针的概念。
感谢您的快速回复。我尝试了修复程序,但最终还是遇到了同样的错误。我的代码中是否存在逻辑错误?
编辑您的问题并将代码替换为修复后的代码。【参考方案2】:
如果您熟悉 gdb,请记住,您始终可以将它用于debug MPI
mpirun -np 4 xterm -e gdb my_mpi_application
这将打开 4 个终端,您可以从中为每个进程使用 gdb。
【讨论】:
【参考方案3】:您似乎只在根进程上分配内存
if (rank == 0)
// READ AND WRITE MATRICES ------------------------------------
// Allocate memory for matrices
int **A = malloc(rows_A * sizeof(int*));
// The cast to size_t prevents integer overflow with big matrices
A[0] = malloc((size_t)rows_A * (size_t)cols_A * sizeof(int));
int **B = malloc(rows_B * sizeof(int*));
// The cast to size_t prevents integer overflow with big matrices
B[0] = malloc((size_t)rows_B * (size_t)cols_B * sizeof(int));
int **AB = malloc(rows_A * sizeof(int*));
// The cast to size_t prevents integer overflow with big matrices
AB[0] = malloc((size_t)rows_A * (size_t)cols_B * sizeof(int));
我可以建议您的第一件事是将读取的矩阵参数与分配分开。然后在你读取矩阵的大小之后,你应该广播它,然后在所有进程上分配它。
另外,通过在 rank==0
条件中声明 int** A
,您可以在代码开头隐藏 A 的声明。
类似这样的:
if(rank == 0)
// Read rows_A, cols_A, rows_B, cols_B
....
MPI_Bcast(rows_A, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(rows_B, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(cols_A, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(cols_B, 1, MPI_INT, 0, MPI_COMM_WORLD);
// allocate memory
....
if(rank == 0)
// read matrix
....
// broadcast matrices
....
【讨论】:
以上是关于需要帮助使用 MPI 调试并行矩阵乘法的主要内容,如果未能解决你的问题,请参考以下文章