使用 MPI 散布不同大小的矩阵块

Posted 2023-03-27

技术标签:

【中文标题】使用 MPI 散布不同大小的矩阵块【英文标题】：Scatter Matrix Blocks of Different Sizes using MPI 【发布时间】：2015-06-02 05:55:24 【问题描述】：

（假设所有矩阵都以行优先顺序存储。）说明问题的一个示例是将 10x10 矩阵分布在 3x3 网格上，这样每个节点中子矩阵的大小看起来像

|-----+-----+-----|
| 3x3 | 3x3 | 3x4 |
|-----+-----+-----|
| 3x3 | 3x3 | 3x4 |
|-----+-----+-----|
| 4x3 | 4x3 | 4x4 |
|-----+-----+-----|

我在 *** 上看到过很多帖子（例如 sending blocks of 2D array in C using MPI 和 MPI partition matrix into blocks）。但它们只处理相同大小的块（在这种情况下，我们可以简单地使用 MPI_Type_vector 或 MPI_Type_create_subarray 并且只使用一个 MPI_Scatterv 调用）。

所以，我想知道在 MPI 中将矩阵分散到处理器网格中的最有效方法是什么，其中每个处理器都有一个具有指定大小的块。

附：我还查看了MPI_Type_create_darray，但它似乎不允许您为每个处理器指定块大小。

【问题讨论】：

@Patrick 感谢您的 cmets。我认为MPI_Type_indexed 不行，因为一个单一的类型仍然只能对应一个特定大小的块。 【参考方案1】：

不确定这是否适用于您，但它在过去对我有所帮助，因此可能对其他人有用。

我的回答适用于并行 IO。问题是，如果你知道你的访问没有重叠，你可以通过使用MPI_COMM_SELF

成功地写入/读取甚至可变大小

我每天使用的一段代码包含：

MPI_File fh;
MPI_File_open(MPI_COMM_SELF, path.c_str(), MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);

// Lot of computation to get the size right

MPI_Datatype filetype;
MPI_Type_create_subarray(gsizes.size(), &gsizes[0], &lsizes[0], &offset[0], MPI_ORDER_C, MPI_FLOAT, &filetype);
MPI_Type_commit(&filetype);

MPI_File_set_view(fh, 0, MPI_FLOAT, filetype, "native", MPI_INFO_NULL);
MPI_File_write(fh, &block->field[0], block->field.size(), MPI_FLOAT, MPI_STATUS_IGNORE);
MPI_File_close(&fh);

【讨论】：

【参考方案2】：

您必须在 MPI 中执行至少一个额外步骤才能执行此操作。

问题在于最通用的聚集/分散例程MPI_Scatterv 和MPI_Gatherv 允许您传递计数/位移的“向量”(v)，而不仅仅是分散和聚集的一个计数，但 types 都被假定为相同。在这里，没有办法。每个块的内存布局是不同的，因此必须由不同的类型来处理。如果块之间只有一个差异——有些有不同的列数，或有些有不同的行数——那么只使用不同的计数就足够了。但是对于不同的列和行，计数不会这样做；您确实需要能够指定不同的类型。

所以你真正想要的是一个经常讨论但从未实现的 MPI_Scatterw（其中 w 表示 vv；例如，计数和类型都是向量）例程。但是这样的事情是不存在的。您可以获得的最接近的是更通用的MPI_Alltoallw 调用，它允许完全通用的全部发送和接收数据；正如规范所述，"The MPI_ALLTOALLW function generalizes several MPI functions by carefully selecting the input arguments. For example, by making all but one process have sendcounts(i) = 0, this achieves an MPI_SCATTERW function."。

因此，您可以使用 MPI_Alltoallw 执行此操作，方法是让最初拥有所有数据的进程（我们假设它在这里的排名为 0）以外的所有进程都将其所有发送计数发送到零。除第一个任务外，所有任务的所有接收计数也将为零 - 他们将从零级获得的数据量。

对于进程 0 的发送计数，我们首先要定义四种不同的类型（4 种不同大小的子数组），然后发送计数将全部为 1，剩下的唯一部分就是弄清楚发送位移（与 scatterv 不同，这里以字节为单位，因为没有单一类型可以用作一个单位）：

        /* 4 types of blocks - 
         * blocksize*blocksize, blocksize+1*blocksize, blocksize*blocksize+1, blocksize+1*blocksize+1
         */

        MPI_Datatype blocktypes[4];
        int subsizes[2];
        int starts[2] = 0,0;
        for (int i=0; i<2; i++) 
           subsizes[0] = blocksize+i;
           for (int j=0; j<2; j++) 
               subsizes[1] = blocksize+j;
               MPI_Type_create_subarray(2, globalsizes, subsizes, starts, MPI_ORDER_C, MPI_CHAR, &blocktypes[2*i+j]);
               MPI_Type_commit(&blocktypes[2*i+j]);
           
        

        /* now figure out the displacement and type of each processor's data */
        for (int proc=0; proc<size; proc++) 
            int row, col;
            rowcol(proc, blocks, &row, &col);

            sendcounts[proc] = 1;
            senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize)*sizeof(char);

            int idx = typeIdx(row, col, blocks);
            sendtypes[proc] = blocktypes[idx];
        
    

    MPI_Alltoallw(globalptr, sendcounts, senddispls, sendtypes,
                  &(localdata[0][0]), recvcounts, recvdispls, recvtypes, 
                  MPI_COMM_WORLD);

这会奏效。

但问题是 Alltoallw 函数非常通用，以至于实现很难在优化方面做很多事情；因此，如果它的性能与相同大小的块的分散一样好，我会感到惊讶。

所以另一种方法是进行类似两个阶段的交流。

在注意到您可以几乎通过单个 MPI_Scatterv() 调用获得所需的所有数据后，最简单的方法如下：在您的示例中，如果我们以单个单元为单位进行操作column=1 和 rows=3 的列向量（域的大多数块中的行数），您可以将几乎所有的全局数据分散到其他处理器。每个处理器获得 3 或 4 个这些向量，它们分布所有数据，除了全局数组的最后一行，可以通过简单的第二个 scatterv 处理。看起来像这样；

/* We're going to be operating mostly in units of a single column of a "normal" sized block.
 * There will need to be two vectors describing these columns; one in the context of the
 * global array, and one in the local results.
 */
MPI_Datatype vec, localvec;
MPI_Type_vector(blocksize, 1, localsizes[1], MPI_CHAR, &localvec);
MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec);
MPI_Type_commit(&localvec);

MPI_Type_vector(blocksize, 1, globalsizes[1], MPI_CHAR, &vec);
MPI_Type_create_resized(vec, 0, sizeof(char), &vec);
MPI_Type_commit(&vec);

/* The originating process needs to allocate and fill the source array,
 * and then define types defining the array chunks to send, and
 * fill out senddispls, sendcounts (1) and sendtypes.
 */
if (rank == 0) 
    /* create the vector type which will send one column of a "normal" sized-block */
    /* then all processors except those in the last row need to get blocksize*vec or (blocksize+1)*vec */
    /* will still have to do something to tidy up the last row of values */
    /* we need to make the type have extent of 1 char for scattering */
    for (int proc=0; proc<size; proc++) 
        int row, col;
        rowcol(proc, blocks, &row, &col);

        sendcounts[proc] = isLastCol(col, blocks) ? blocksize+1 : blocksize;
        senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize);
    


recvcounts = localsizes[1];
MPI_Scatterv(globalptr, sendcounts, senddispls, vec,
              &(localdata[0][0]), recvcounts, localvec, 0, MPI_COMM_WORLD);

MPI_Type_free(&localvec);
if (rank == 0)
    MPI_Type_free(&vec);

/* now we need to do one more scatter, scattering just the last row of data
 * just to the processors on the last row.
 * Here we recompute the send counts
 */
if (rank == 0) 
    for (int proc=0; proc<size; proc++) 
        int row, col;
        rowcol(proc, blocks, &row, &col);
        sendcounts[proc] = 0;
        senddispls[proc] = 0;

        if ( isLastRow(row,blocks) ) 
            sendcounts[proc] = blocksize;
            senddispls[proc] = (globalsizes[0]-1)*globalsizes[1]+col*blocksize;
            if ( isLastCol(col,blocks) )
                sendcounts[proc] += 1;
        
    


recvcounts = 0;
if ( isLastRow(myrow, blocks) ) 
    recvcounts = blocksize;
    if ( isLastCol(mycol, blocks) )
        recvcounts++;

MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR,
              &(localdata[blocksize][0]), recvcounts, MPI_CHAR, 0, MPI_COMM_WORLD);

到目前为止一切顺利。但是，在最后的“清理”分散期间，让大多数处理器无所事事，这是一种耻辱。

因此，更好的方法是在第一阶段分散所有行，并在第二阶段将这些数据分散到列中。在这里，我们创建了新的通信器，每个处理器都属于两个新的通信器——一个表示同一块行中的其他处理器，另一个表示同一块列中的其他处理器。在第一步中，原始处理器将全局数组的所有行分发到同一列通信器中的其他处理器 - 这可以在单个分散器中完成。然后，这些处理器使用单个 scatterv 和与前面示例中相同的列数据类型，将列分散到与其相同的块行中的每个处理器。结果是两个相当简单的 scatterv 分布所有数据：

/* create communicators which have processors with the same row or column in them*/
MPI_Comm colComm, rowComm;
MPI_Comm_split(MPI_COMM_WORLD, myrow, rank, &rowComm);
MPI_Comm_split(MPI_COMM_WORLD, mycol, rank, &colComm);

/* first, scatter the array by rows, with the processor in column 0 corresponding to each row
 * receiving the data */
if (mycol == 0) 
    int sendcounts[ blocks[0] ];
    int senddispls[ blocks[0] ];
    senddispls[0] = 0;

    for (int row=0; row<blocks[0]; row++) 
        /* each processor gets blocksize rows, each of size globalsizes[1]... */
        sendcounts[row] = blocksize*globalsizes[1];
        if (row > 0)
            senddispls[row] = senddispls[row-1] + sendcounts[row-1];
    
    /* the last processor gets one more */
    sendcounts[blocks[0]-1] += globalsizes[1];

    /* allocate my rowdata */
    rowdata = allocchar2darray( sendcounts[myrow], globalsizes[1] );

    /* perform the scatter of rows */
    MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR,
                  &(rowdata[0][0]), sendcounts[myrow], MPI_CHAR, 0, colComm);



/* Now, within each row of processors, we can scatter the columns.
 * We can do this as we did in the previous example; create a vector
 * (and localvector) type and scatter accordingly */
int locnrows = blocksize;
if ( isLastRow(myrow, blocks) )
    locnrows++;
MPI_Datatype vec, localvec;
MPI_Type_vector(locnrows, 1, globalsizes[1], MPI_CHAR, &vec);
MPI_Type_create_resized(vec, 0, sizeof(char), &vec);
MPI_Type_commit(&vec);

MPI_Type_vector(locnrows, 1, localsizes[1], MPI_CHAR, &localvec);
MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec);
MPI_Type_commit(&localvec);

int sendcounts[ blocks[1] ];
int senddispls[ blocks[1] ];
if (mycol == 0) 
    for (int col=0; col<blocks[1]; col++) 
        sendcounts[col] = isLastCol(col, blocks) ? blocksize+1 : blocksize;
        senddispls[col] = col*blocksize;
    

char *rowptr = (mycol == 0) ? &(rowdata[0][0]) : NULL;

MPI_Scatterv(rowptr, sendcounts, senddispls, vec,
              &(localdata[0][0]), sendcounts[mycol], localvec, 0, rowComm);

这个比较简单，应该是性能和健壮性之间比较好的平衡点。

运行所有这三种方法都有效：

bash-3.2$ mpirun -np 6 ./allmethods alltoall
Global array:
abcdefg
hijklmn
opqrstu
vwxyzab
cdefghi
jklmnop
qrstuvw
xyzabcd
efghijk
lmnopqr
Method - alltoall

Rank 0:
abc
hij
opq

Rank 1:
defg
klmn
rstu

Rank 2:
vwx
cde
jkl

Rank 3:
yzab
fghi
mnop

Rank 4:
qrs
xyz
efg
lmn

Rank 5:
tuvw
abcd
hijk
opqr

bash-3.2$ mpirun -np 6 ./allmethods twophasevecs
Global array:
abcdefg
hijklmn
opqrstu
vwxyzab
cdefghi
jklmnop
qrstuvw
xyzabcd
efghijk
lmnopqr
Method - two phase, vectors, then cleanup

Rank 0:
abc
hij
opq

Rank 1:
defg
klmn
rstu

Rank 2:
vwx
cde
jkl

Rank 3:
yzab
fghi
mnop

Rank 4:
qrs
xyz
efg
lmn

Rank 5:
tuvw
abcd
hijk
opqr
bash-3.2$ mpirun -np 6 ./allmethods twophaserowcol
Global array:
abcdefg
hijklmn
opqrstu
vwxyzab
cdefghi
jklmnop
qrstuvw
xyzabcd
efghijk
lmnopqr
Method - two phase - row, cols

Rank 0:
abc
hij
opq

Rank 1:
defg
klmn
rstu

Rank 2:
vwx
cde
jkl

Rank 3:
yzab
fghi
mnop

Rank 4:
qrs
xyz
efg
lmn

Rank 5:
tuvw
abcd
hijk
opqr

实现这些方法的代码如下；您可以针对您的问题将块大小设置为更典型的大小，并在实际数量的处理器上运行，以了解哪种处理器最适合您的应用程序。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "mpi.h"

/* auxiliary routines, found at end of program */

char **allocchar2darray(int n, int m);
void freechar2darray(char **a);
void printarray(char **data, int n, int m);
void rowcol(int rank, const int blocks[2], int *row, int *col);
int isLastRow(int row, const int blocks[2]);
int isLastCol(int col, const int blocks[2]);
int typeIdx(int row, int col, const int blocks[2]);

/* first method - alltoallw */
void alltoall(const int myrow, const int mycol, const int rank, const int size, 
                    const int blocks[2], const int blocksize, const int globalsizes[2], const int localsizes[2],
                    const char *const globalptr,  char **localdata) 
    /*
     * get send and recieve counts ready for alltoallw call. 
     * everyone will be recieving just one block from proc 0; 
     * most procs will be sending nothing to anyone. 
     */
    int sendcounts[ size ];
    int senddispls[ size ];
    MPI_Datatype sendtypes[size];
    int recvcounts[ size ];
    int recvdispls[ size ];
    MPI_Datatype recvtypes[size];

    for (int proc=0; proc<size; proc++) 
        recvcounts[proc] = 0;
        recvdispls[proc] = 0;
        recvtypes[proc] = MPI_CHAR;

        sendcounts[proc] = 0;
        senddispls[proc] = 0;
        sendtypes[proc] = MPI_CHAR;
    
    recvcounts[0] = localsizes[0]*localsizes[1];
    recvdispls[0] = 0;


    /* The originating process needs to allocate and fill the source array,
     * and then define types defining the array chunks to send, and 
     * fill out senddispls, sendcounts (1) and sendtypes.
     */
    if (rank == 0) 
        /* 4 types of blocks - 
         * blocksize*blocksize, blocksize+1*blocksize, blocksize*blocksize+1, blocksize+1*blocksize+1
         */
        MPI_Datatype blocktypes[4];
        int subsizes[2];
        int starts[2] = 0,0;
        for (int i=0; i<2; i++) 
           subsizes[0] = blocksize+i;
           for (int j=0; j<2; j++) 
               subsizes[1] = blocksize+j;
               MPI_Type_create_subarray(2, globalsizes, subsizes, starts, MPI_ORDER_C, MPI_CHAR, &blocktypes[2*i+j]);
               MPI_Type_commit(&blocktypes[2*i+j]);
           
        

        /* now figure out the displacement and type of each processor's data */
        for (int proc=0; proc<size; proc++) 
            int row, col;
            rowcol(proc, blocks, &row, &col);

            sendcounts[proc] = 1;
            senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize)*sizeof(char);

            int idx = typeIdx(row, col, blocks);
            sendtypes[proc] = blocktypes[idx];
        
    

    MPI_Alltoallw(globalptr, sendcounts, senddispls, sendtypes,
                  &(localdata[0][0]), recvcounts, recvdispls, recvtypes, 
                  MPI_COMM_WORLD);



/* second  method: distribute almost all data using colums of size blocksize, 
 * then clean up the last row with another scatterv */

void twophasevecs(const int myrow, const int mycol, const int rank, const int size, 
                    const int blocks[2], const int blocksize, const int globalsizes[2], const int localsizes[2],
                    const char *const globalptr,  char **localdata) 
    int sendcounts[ size ];
    int senddispls[ size ];
    int recvcounts;

    for (int proc=0; proc<size; proc++) 
        sendcounts[proc] = 0;
        senddispls[proc] = 0;
    

    /* We're going to be operating mostly in units of a single column of a "normal" sized block.
     * There will need to be two vectors describing these columns; one in the context of the
     * global array, and one in the local results.
     */
    MPI_Datatype vec, localvec;
    MPI_Type_vector(blocksize, 1, localsizes[1], MPI_CHAR, &localvec);
    MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec);
    MPI_Type_commit(&localvec);

    MPI_Type_vector(blocksize, 1, globalsizes[1], MPI_CHAR, &vec);
    MPI_Type_create_resized(vec, 0, sizeof(char), &vec);
    MPI_Type_commit(&vec);

    /* The originating process needs to allocate and fill the source array,
     * and then define types defining the array chunks to send, and 
     * fill out senddispls, sendcounts (1) and sendtypes.
     */
    if (rank == 0) 
        /* create the vector type which will send one column of a "normal" sized-block */
        /* then all processors except those in the last row need to get blocksize*vec or (blocksize+1)*vec */
        /* will still have to do something to tidy up the last row of values */
        /* we need to make the type have extent of 1 char for scattering */
        for (int proc=0; proc<size; proc++) 
            int row, col;
            rowcol(proc, blocks, &row, &col);

            sendcounts[proc] = isLastCol(col, blocks) ? blocksize+1 : blocksize;
            senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize);
        
    

    recvcounts = localsizes[1];
    MPI_Scatterv(globalptr, sendcounts, senddispls, vec,
                  &(localdata[0][0]), recvcounts, localvec, 0, MPI_COMM_WORLD);

    MPI_Type_free(&localvec);
    if (rank == 0)
        MPI_Type_free(&vec);

    /* now we need to do one more scatter, scattering just the last row of data 
     * just to the processors on the last row.
     * Here we recompute the sendcounts
     */
    if (rank == 0) 
        for (int proc=0; proc<size; proc++) 
            int row, col;
            rowcol(proc, blocks, &row, &col);
            sendcounts[proc] = 0;
            senddispls[proc] = 0;

            if ( isLastRow(row,blocks) ) 
                sendcounts[proc] = blocksize;
                senddispls[proc] = (globalsizes[0]-1)*globalsizes[1]+col*blocksize;
                if ( isLastCol(col,blocks) ) 
                    sendcounts[proc] += 1;
            
        
    

    recvcounts = 0;
    if ( isLastRow(myrow, blocks) ) 
        recvcounts = blocksize;
        if ( isLastCol(mycol, blocks) )
            recvcounts++;
    
    MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR,
                  &(localdata[blocksize][0]), recvcounts, MPI_CHAR, 0, MPI_COMM_WORLD);

/* third method: first distribute rows, then columns, each with a single scatterv */

void twophaseRowCol(const int myrow, const int mycol, const int rank, const int size, 
                    const int blocks[2], const int blocksize, const int globalsizes[2], const int localsizes[2],
                    const char *const globalptr,  char **localdata) 
    char **rowdata ;

    /* create communicators which have processors with the same row or column in them*/
    MPI_Comm colComm, rowComm;
    MPI_Comm_split(MPI_COMM_WORLD, myrow, rank, &rowComm);
    MPI_Comm_split(MPI_COMM_WORLD, mycol, rank, &colComm);

    /* first, scatter the array by rows, with the processor in column 0 corresponding to each row
     * receiving the data */
    if (mycol == 0) 
        int sendcounts[ blocks[0] ];
        int senddispls[ blocks[0] ];
        senddispls[0] = 0;

        for (int row=0; row<blocks[0]; row++) 
            /* each processor gets blocksize rows, each of size globalsizes[1]... */
            sendcounts[row] = blocksize*globalsizes[1];
            if (row > 0) 
                senddispls[row] = senddispls[row-1] + sendcounts[row-1];
        
        /* the last processor gets one more */
        sendcounts[blocks[0]-1] += globalsizes[1];

        /* allocate my rowdata */
        rowdata = allocchar2darray( sendcounts[myrow], globalsizes[1] );

        /* perform the scatter of rows */
        MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR,
                      &(rowdata[0][0]), sendcounts[myrow], MPI_CHAR, 0, colComm);

    

    /* Now, within each row of processors, we can scatter the columns.  
     * We can do this as we did in the previous example; create a vector
     * (and localvector) type and scatter accordingly */
    int locnrows = blocksize;
    if ( isLastRow(myrow, blocks) )
        locnrows++;

    MPI_Datatype vec, localvec;
    MPI_Type_vector(locnrows, 1, globalsizes[1], MPI_CHAR, &vec);
    MPI_Type_create_resized(vec, 0, sizeof(char), &vec);
    MPI_Type_commit(&vec);

    MPI_Type_vector(locnrows, 1, localsizes[1], MPI_CHAR, &localvec);
    MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec);
    MPI_Type_commit(&localvec);

    int sendcounts[ blocks[1] ];
    int senddispls[ blocks[1] ];
    if (mycol == 0) 
        for (int col=0; col<blocks[1]; col++) 
            sendcounts[col] = isLastCol(col, blocks) ? blocksize+1 : blocksize;
            senddispls[col] = col*blocksize;
        
    
    char *rowptr = (mycol == 0) ? &(rowdata[0][0]) : NULL;

    MPI_Scatterv(rowptr, sendcounts, senddispls, vec,
                  &(localdata[0][0]), sendcounts[mycol], localvec, 0, rowComm);

    MPI_Type_free(&localvec);
    MPI_Type_free(&vec);

    if (mycol == 0) 
        freechar2darray(rowdata);

    MPI_Comm_free(&rowComm);
    MPI_Comm_free(&colComm);


int main(int argc, char **argv) 

    int rank, size;
    int blocks[2] = 0,0;
    const int blocksize=3;
    int globalsizes[2], localsizes[2];
    char **globaldata;
    char *globalptr = NULL;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    if (rank == 0 && argc < 2) 
        fprintf(stderr,"Usage: %s method\n   Where method is one of: alltoall, twophasevecs, twophaserowcol\n", argv[0]);
        MPI_Abort(MPI_COMM_WORLD,1);
    

    /* calculate sizes for a 2d grid of processors */
    MPI_Dims_create(size, 2, blocks);

    int myrow, mycol;
    rowcol(rank, blocks, &myrow, &mycol);

    /* create array sizes so that last block has 1 too many rows/cols */
    globalsizes[0] = blocks[0]*blocksize+1;  
    globalsizes[1] = blocks[1]*blocksize+1;
    if (rank == 0) 
        globaldata = allocchar2darray(globalsizes[0], globalsizes[1]);
        globalptr = &(globaldata[0][0]);
        for (int i=0; i<globalsizes[0]; i++) 
            for (int j=0; j<globalsizes[1]; j++)
                globaldata[i][j] = 'a'+(i*globalsizes[1] + j)%26;

        printf("Global array: \n");
        printarray(globaldata, globalsizes[0], globalsizes[1]);
    

    /* the local chunk we'll be receiving */
    localsizes[0] = blocksize; localsizes[1] = blocksize;
    if ( isLastRow(myrow,blocks)) localsizes[0]++;
    if ( isLastCol(mycol,blocks)) localsizes[1]++;
    char **localdata = allocchar2darray(localsizes[0],localsizes[1]);

    if (!strcasecmp(argv[1], "alltoall")) 
        if (rank == 0) printf("Method - alltoall\n");
        alltoall(myrow, mycol, rank, size, blocks, blocksize, globalsizes, localsizes, globalptr, localdata);
     else if (!strcasecmp(argv[1],"twophasevecs")) 
        if (rank == 0) printf("Method - two phase, vectors, then cleanup\n");
        twophasevecs(myrow, mycol, rank, size, blocks, blocksize, globalsizes, localsizes, globalptr, localdata);
     else 
        if (rank == 0) printf("Method - two phase - row, cols\n");
        twophaseRowCol(myrow, mycol, rank, size, blocks, blocksize, globalsizes, localsizes, globalptr, localdata);
    

    for (int proc=0; proc<size; proc++) 
        if (proc == rank) 
            printf("\nRank %d:\n", proc);
            printarray(localdata, localsizes[0], localsizes[1]);
        
        MPI_Barrier(MPI_COMM_WORLD);            
    

    freechar2darray(localdata);
    if (rank == 0) 
        freechar2darray(globaldata);

    MPI_Finalize();

    return 0;


char **allocchar2darray(int n, int m) 
    char **ptrs = malloc(n*sizeof(char *));
    ptrs[0] = malloc(n*m*sizeof(char));
    for (int i=0; i<n*m; i++)
        ptrs[0][i]='.';

    for (int i=1; i<n; i++) 
        ptrs[i] = ptrs[i-1] + m;

    return ptrs;


void freechar2darray(char **a) 
    free(a[0]);
    free(a);


void printarray(char **data, int n, int m) 
    for (int i=0; i<n; i++) 
        for (int j=0; j<m; j++) 
            putchar(data[i][j]);
        putchar('\n');
    


void rowcol(int rank, const int blocks[2], int *row, int *col) 
    *row = rank/blocks[1];
    *col = rank % blocks[1];


int isLastRow(int row, const int blocks[2]) 
    return (row == blocks[0]-1);


int isLastCol(int col, const int blocks[2]) 
    return (col == blocks[1]-1);


int typeIdx(int row, int col, const int blocks[2]) 
    int lastrow = (row == blocks[0]-1);
    int lastcol = (col == blocks[1]-1);

    return lastrow*2 + lastcol;

【讨论】：

谢谢，太好了。我进行了一些测试，将 4000*4000 矩阵分布在 32 个进程上，第三种方法（我认为这是最好的）比单个散点图花费的时间大约长 5 倍。知道为什么吗？我怀疑第三种方法最终在扩展方面最好，与第一种相比，但它可能需要在很多处理器上运行才能胜过有两个相当大的操作。（而且我希望这两个操作总是比单个统一大小的散点图更昂贵）。如果这是真的，第二种方法可能具有相似的性能，而第一种方法可能更快？这有点难以猜测，我承认我没有做适当的缩放测试。好的，我还没有实现另外两个，所以他们可能会做得更好。我原以为第三种方法不会比 2 倍差，因为每一步都不比全局分散差。也许在创建类型等方面存在开销？这是一个很好的观点。类型创建应该可以忽略不计；如果您经常这样做，您可以缓存类型而不是每次都创建它们。我将尝试对此进行分析，并了解时间的去向。

以上是关于使用 MPI 散布不同大小的矩阵块的主要内容，如果未能解决你的问题，请参考以下文章

MPI 块矩阵乘法

利用MPI实现Cannon算法并行矩阵乘法

可鉴别共同向量法

带有 MPI_Type_struct 的 MPI_Scatterv 子矩阵