CUDA 并行扫描算法共享内存竞争条件

Posted

技术标签:

【中文标题】CUDA 并行扫描算法共享内存竞争条件【英文标题】:CUDA parallel scan algorithm shared memory race condition 【发布时间】:2022-01-18 04:17:32 【问题描述】:

我正在阅读“大规模并行处理器编程”(第 3 版)一书,其中介绍了 Kogge-Stone 并行扫描算法的实现。 该算法旨在由单个块运行(这只是初步简化),以下是实现。

// X is the input array, Y is the output array, InputSize is the size of the input array
__global__ void Kogge_Stone_scan_kernel(float* X, float* Y, int InputSize) 
    __shared__ float XY[SECTION_SIZE]; // SECTION_SIZE is basically blockDim.x
    
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < InputSize)
        XY[threadIdx.x] = X[i];

    for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) 
        __syncthreads();
        if (threadIdx.x >= stride)
            XY[threadIdx.x] += XY[threadIdx.x - stride]; // Race condition here?
    

    Y[i] = XY[threadIdx.x];

不管算法的工作方式如何,我对这条线有点困惑 XY[threadIdx.x] += XY[threadIdx.x - stride]。说stride = 1,那么threadIdx.x = 6的线程就会执行XY[6] += XY[5]的操作。但是,同时带有threadIdx.x = 5 的线程将执行XY[5] += XY[4]。问题是:是否可以保证线程6 将读取XY[5] 的原始值而不是XY[5] + XY[4]?。请注意,这不仅限于锁步执行可能会阻止竞态条件的单个 warp。

谢谢

【问题讨论】:

【参考方案1】:

是否保证线程 6 将读取 XY[5] 的原始值而不是 XY[5] + XY[4]

不,CUDA 不保证线程执行顺序(锁步或其他),代码中也没有任何东西可以解决这个问题。

顺便说一句,cuda-memcheckcompute-sanitizer 非常擅长识别共享内存竞争条件:

$ cat t2.cu
const int SECTION_SIZE = 256;
__global__ void Kogge_Stone_scan_kernel(float* X, float* Y, int InputSize) 
    __shared__ float XY[SECTION_SIZE]; // SECTION_SIZE is basically blockDim.x

    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < InputSize)
        XY[threadIdx.x] = X[i];

    for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) 
        __syncthreads();
        if (threadIdx.x >= stride)
            XY[threadIdx.x] += XY[threadIdx.x - stride]; // Race condition here?
    

    Y[i] = XY[threadIdx.x];


int main()
  const int nblk = 1;
  const int sz = nblk*SECTION_SIZE;
  const int bsz = sz*sizeof(float);
  float *X, *Y;
  cudaMallocManaged(&X, bsz);
  cudaMallocManaged(&Y, bsz);
  Kogge_Stone_scan_kernel<<<nblk, SECTION_SIZE>>>(X, Y, sz);
  cudaDeviceSynchronize();

$ nvcc -o t2 t2.cu -lineinfo
$ cuda-memcheck ./t2
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$ cuda-memcheck --tool racecheck ./t2
========= CUDA-MEMCHECK
========= ERROR: Race reported between Read access at 0x000001a0 in /home/user2/misc/junk/t2.cu:12:Kogge_Stone_scan_kernel(float*, float*, int)
=========     and Write access at 0x000001c0 in /home/user2/misc/junk/t2.cu:12:Kogge_Stone_scan_kernel(float*, float*, int) [6152 hazards]
=========
========= RACECHECK SUMMARY: 1 hazard displayed (1 error, 0 warnings)
$

正如您可能已经猜到的那样,您可以通过分解违规行中的读取和写入操作来解决此问题,中间有一个障碍:

$ cat t2.cu
const int SECTION_SIZE = 256;
__global__ void Kogge_Stone_scan_kernel(float* X, float* Y, int InputSize) 
    __shared__ float XY[SECTION_SIZE]; // SECTION_SIZE is basically blockDim.x

    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < InputSize)
        XY[threadIdx.x] = X[i];

    for (unsigned int stride = 1; stride < blockDim.x; stride *= 2) 
        __syncthreads();
        float val;
        if (threadIdx.x >= stride)
            val = XY[threadIdx.x - stride];
        __syncthreads();
        if (threadIdx.x >= stride)
            XY[threadIdx.x] += val;
    

    Y[i] = XY[threadIdx.x];


int main()
  const int nblk = 1;
  const int sz = nblk*SECTION_SIZE;
  const int bsz = sz*sizeof(float);
  float *X, *Y;
  cudaMallocManaged(&X, bsz);
  cudaMallocManaged(&Y, bsz);
  Kogge_Stone_scan_kernel<<<nblk, SECTION_SIZE>>>(X, Y, sz);
  cudaDeviceSynchronize();

$ nvcc -o t2 t2.cu -lineinfo
$ cuda-memcheck --tool racecheck ./t2
========= CUDA-MEMCHECK
========= RACECHECK SUMMARY: 0 hazards displayed (0 errors, 0 warnings)
$

【讨论】:

以上是关于CUDA 并行扫描算法共享内存竞争条件的主要内容,如果未能解决你的问题,请参考以下文章

银行冲突CUDA共享内存?

nvidia cuda访问gpu共享内存

CUDA:啥时候使用共享内存,啥时候依赖 L1 缓存?

cuda 共享内存和块执行调度

cuda学习3-共享内存和同步

两个进程可以共享相同的 GPU 内存吗? (CUDA)