为啥并行和串行版本的执行时间几乎一样

Posted 2023-03-27

技术标签:

【中文标题】为啥并行和串行版本的执行时间几乎一样【英文标题】：Why The Execution Time of Parallel and Serial Version Are Almost The Same为什么并行和串行版本的执行时间几乎一样 【发布时间】：2018-03-17 21:20:08 【问题描述】：

我在 C 中关注这个 PI Calculation 示例，以比较串行和并行版本的执行时间。我使用gettimeofday() 来测量执行时间。但执行时间大致相同。我的代码有什么问题吗？还是时间的测量方法？

我的代码如下：

系列版

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

int main()


struct timeval tvalBefore, tvalAfter;
gettimeofday(&tvalBefore, NULL);

#define sqr(x) ((x)*(x))
long random(void);
double x_coord, y_coord, pi, r;
int score, n;
unsigned int cconst;
int darts = 5000000;

if (sizeof(cconst) != 4) 
    printf("Wrong data size for cconst variable!\nQuitting.\n");
    exit(1);


cconst = 2 << (31 - 1);
score = 0;

for (n = 1; n <= darts; n++) 
    r = (double)random() / cconst;
    x_coord = (2.0 * r) - 1.0;
    r = (double)random() / cconst;
    y_coord = (2.0 * r) - 1.0;

    if ((sqr(x_coord) + sqr(y_coord)) <= 1.0)
        score++;


pi = 4.0 * (double)score / (double)darts;

gettimeofday(&tvalAfter, NULL);
long tm = (tvalAfter.tv_sec - tvalBefore.tv_sec) * 1000000L + tvalAfter.tv_usec - tvalBefore.tv_usec;

printf("PI = %lf\nSerial execution time: %ld microseconds\n", pi, tm);
return 0;

平行版

/**********************************************************************
 * FILE: mpi_pi_reduce.c
 * OTHER FILES: dboard.c
 * DESCRIPTION:  
 *   MPI pi Calculation Example - C Version 
 *   Collective Communication example:  
 *   This program calculates pi using a "dartboard" algorithm.  See
 *   Fox et al.(1988) Solving Problems on Concurrent Processors, vol.1
 *   page 207.  All processes contribute to the calculation, with the
 *   master averaging the values for pi. This version uses mpc_reduce to 
 *   collect results
 * AUTHOR: Blaise Barney. Adapted from Ros Leibensperger, Cornell Theory
 *   Center. Converted to MPI: George L. Gusciora, MHPCC (1/95) 
 * LAST REVISED: 06/13/13 Blaise Barney
**********************************************************************/
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void srandom (unsigned seed);
double dboard (int darts);
#define DARTS 50000     /* number of throws at dartboard */
#define ROUNDS 100      /* number of times "darts" is iterated */
#define MASTER 0        /* task ID of master task */

int main (int argc, char *argv[])

struct timeval tvalBefore, tvalAfter;
gettimeofday(&tvalBefore, NULL);

double  homepi,         /* value of pi calculated by current task */
        pisum,          /* sum of tasks' pi values */
        pi,             /* average of pi after "darts" is thrown */
        avepi;          /* average pi value for all iterations */
int taskid,         /* task ID - also used as seed number */
    numtasks,       /* number of tasks */
    rc,             /* return code */
    i;
MPI_Status status;

/* Obtain number of tasks and task ID */
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
printf ("MPI task %d has started...\n", taskid);

/* Set seed for random number generator equal to task ID */
srandom (taskid);

avepi = 0;
for (i = 0; i < ROUNDS; i++) 
   /* All tasks calculate pi using dartboard algorithm */
   homepi = dboard(DARTS);

   /* Use MPI_Reduce to sum values of homepi across all tasks 
    * Master will store the accumulated value in pisum 
    * - homepi is the send buffer
    * - pisum is the receive buffer (used by the receiving task only)
    * - the size of the message is sizeof(double)
    * - MASTER is the task that will receive the result of the reduction
    *   operation
    * - MPI_SUM is a pre-defined reduction function (double-precision
    *   floating-point vector addition).  Must be declared extern.
    * - MPI_COMM_WORLD is the group of tasks that will participate.
    */

   rc = MPI_Reduce(&homepi, &pisum, 1, MPI_DOUBLE, MPI_SUM,
                   MASTER, MPI_COMM_WORLD);

   /* Master computes average for this iteration and all iterations */
   if (taskid == MASTER) 
      pi = pisum/numtasks;
      avepi = ((avepi * i) + pi)/(i + 1); 
      //printf("   After %8d throws, average value of pi = %10.8f\n", (DARTS * (i + 1)),avepi);
       
 
if (taskid == MASTER) 
   gettimeofday(&tvalAfter, NULL);
   long tm = (tvalAfter.tv_sec - tvalBefore.tv_sec) * 1000000L + tvalAfter.tv_usec - tvalBefore.tv_usec;
   printf("\nReal value of PI: 3.1415926535897 \n"); 
   printf("Parallel execution time: %ld microseconds\n", tm);

MPI_Finalize();

return 0;




/**************************************************************************
* subroutine dboard
* DESCRIPTION:
*   Used in pi calculation example codes. 
*   See mpi_pi_send.c and mpi_pi_reduce.c  
*   Throw darts at board.  Done by generating random numbers 
*   between 0 and 1 and converting them to values for x and y 
*   coordinates and then testing to see if they "land" in 
*   the circle."  If so, score is incremented.  After throwing the 
*   specified number of darts, pi is calculated.  The computed value 
*   of pi is returned as the value of this function, dboard. 
*
*   Explanation of constants and variables used in this function:
*   darts       = number of throws at dartboard
*   score       = number of darts that hit circle
*   n           = index variable
*   r           = random number scaled between 0 and 1
*   x_coord     = x coordinate, between -1 and 1
*   x_sqr       = square of x coordinate
*   y_coord     = y coordinate, between -1 and 1
*   y_sqr       = square of y coordinate
*   pi          = computed value of pi
****************************************************************************/

double dboard(int darts)

#define sqr(x)  ((x)*(x))
long random(void);
double x_coord, y_coord, pi, r; 
int score, n;
unsigned int cconst;  /* must be 4-bytes in size */
/*************************************************************************
 * The cconst variable must be 4 bytes. We check this and bail if it is
 * not the right size
 ************************************************************************/
if (sizeof(cconst) != 4) 
   printf("Wrong data size for cconst variable in dboard routine!\n");
   printf("See comments in source file. Quitting.\n");
   exit(1);
   
   /* 2 bit shifted to MAX_RAND later used to scale random number between 0 and 1 */
   cconst = 2 << (31 - 1);
   score = 0;

   /* "throw darts at board" */
   for (n = 1; n <= darts; n++)  
      /* generate random numbers for x and y coordinates */
      r = (double)random()/cconst;
      x_coord = (2.0 * r) - 1.0;
      r = (double)random()/cconst;
      y_coord = (2.0 * r) - 1.0;

      /* if dart lands in circle, increment score */
      if ((sqr(x_coord) + sqr(y_coord)) <= 1.0)
           score++;
      

/* calculate pi */
pi = 4.0 * (double)score/(double)darts;
return(pi);

我在集群上编写并运行了代码。我遵守并运行了代码

mpicc serial.c -o serial.o
mpicc parallel.c -o parallel.o

mpirun -n 1 serial.o
mpirun -np 4 -pernode parallel.o

结果是：

# serial
PI = 3.142431
Serial execution time: 262699 microseconds

# parallel
MPI task 1 has started...
MPI task 0 has started...
MPI task 3 has started...
MPI task 2 has started...

Real value of PI: 3.1415926535897
Parallel execution time: 294984 microseconds

【问题讨论】：

random() 函数的原型在 stdlib.h 中，所以在代码中提供原型是个糟糕的主意 srandom() 函数的原型在 stdlib.h 中，因此在代码中提供原型是个糟糕的主意。我首先要问为什么我希望并行版本更快。并行计算是否允许使用原本处于空闲状态的 CPU 资源？然后我会检查工作是否实际上分布在 CPU 内核上。 @user3629249 我不熟悉 C atm。感谢您的建议，我会修改代码。您可以尝试通过 355.0 / 113.0 计算 pi 快速、快速且精确到小数位数 【参考方案1】：

并行化在哪里？

5,000,000 迭代中的串行版本计算 pi。在并行版本中，每个任务也在执行50,000 * 100 迭代，然后取平均值。因此并行版本可能“在统计上更准确”，但不会更快。

另外，当我认为只需要一个时，您可以使用 500 MPI_Reduce()。最重要的是，我什至很惊讶“并行”版本并没有慢多少。

如果你想通过并行化运行得更快，每个任务应该计算从5,000,000 * taskid / numtasks开始的5,000,000 / numtasks迭代，然后你应该发出一个MPI_Reduce()。

【讨论】：

对！！！我只是从示例中运行了并行版本，但没有注意到这一点。并感谢有关 MPI_Reduce() 的建议。

以上是关于为啥并行和串行版本的执行时间几乎一样的主要内容，如果未能解决你的问题，请参考以下文章