为啥我使用 openMP atomic 的并行代码比串行代码花费更长的时间？

Posted 2023-02-22

技术标签:

【中文标题】为啥我使用 openMP atomic 的并行代码比串行代码花费更长的时间？【英文标题】：Why my parallel code using openMP atomic takes a longer time than serial code?为什么我使用 openMP atomic 的并行代码比串行代码花费更长的时间？ 【发布时间】：2020-11-13 15:00:39 【问题描述】：

我的序列号的sn-p如下图。

 Program main
  use omp_lib
  Implicit None
   
  Integer :: i, my_id
  Real(8) :: t0, t1, t2, t3, a = 0.0d0

  !$ t0 = omp_get_wtime()
  Call CPU_time(t2)
  ! ------------------------------------------ !

    Do i = 1, 100000000
      a = a + Real(i)
    End Do

  ! ------------------------------------------ !
  Call CPU_time(t3)
  !$ t1 = omp_get_wtime()
  ! ------------------------------------------ !

  Write (*,*) "a = ", a
  Write (*,*) "The wall time is ", t1-t0, "s"
  Write (*,*) "The CPU time is ", t3-t2, "s"
End Program main

经过的时间：

通过使用 omp 指令 do 和 atomic，我将串行代码转换为并行代码。但是，并行程序比串行程序慢。我不明白为什么会这样。接下来是我的并行代码sn-p：

Program main
  use omp_lib
  Implicit None
    
  Integer, Parameter :: n_threads = 8
  Integer :: i, my_id
  Real(8) :: t0, t1, t2, t3, a = 0.0d0
 
  !$ t0 = omp_get_wtime()
  Call CPU_time(t2)
  ! ------------------------------------------ !

  !$OMP Parallel Num_threads(n_threads) shared(a)
  
   !$OMP Do 
     Do i = 1, 100000000
       !$OMP Atomic
       a = a + Real(i)
     End Do
   !$OMP End Do
  
  !$OMP End Parallel
  
  ! ------------------------------------------ !
  Call CPU_time(t3)
  !$ t1 = omp_get_wtime()
  ! ------------------------------------------ !

  Write (*,*) "a = ", a
  Write (*,*) "The wall time is ", t1-t0, "s"
  Write (*,*) "The CPU time is ", t3-t2, "s"
End Program main

经过的时间：

所以我的问题是为什么我使用 openMP atomic 的并行代码比串行代码需要更长的时间？

【问题讨论】：

【参考方案1】：

您在每次循环迭代中都对同一个变量应用atomic 操作。此外，该变量在这些循环迭代之间具有相互依赖性。自然，与顺序版本相比，这会带来额外的开销（例如同步、序列化成本和 CPU 周期）。此外，由于线程使其缓存无效，您可能会遇到很多缓存未命中。

此代码是应该使用变量a 的reduction 的典型代码（即 !$omp parallel do reduction(+:a))，而不是原子操作。通过归约操作，每个线程将拥有变量'a' 的私有副本，并且在parallel region 结束时，线程会将其变量'a'（使用'+' 运算符）的副本减少为将传播到变量的单个值主线程的'a'。

您可以在SO thread 上找到有关原子vs. 减少之间差异的更详细答案。在那个线程中，甚至还有一个代码，它（就像你的一样）它的atomic 版本比它的顺序对应版本慢几个数量级（即慢 20 倍）。在这种情况下，它甚至比你的更糟糕（即 20x Vs 10x）。

【讨论】：

以上是关于为啥我使用 openMP atomic 的并行代码比串行代码花费更长的时间？的主要内容，如果未能解决你的问题，请参考以下文章