CUDA中的几种并行计算模型

Posted ithiker

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了CUDA中的几种并行计算模型相关的知识,希望对你有一定的参考价值。

Background

CUDA编程中由于thread space和input space通常是不同的:

  • Thread Space最多可以是5D,2D Grid + 3D Block
  • Input Space通常是1D,有时也有2D的矩阵等

这就导致了我们需要将inputs映射到threads中去,比如我在《CUDA中线程与数据的对应关系》中所写的,主要就是考虑线程数和数据数下的映射关系。

如果我们有一个1D的Input array, 采用不同的thread space, thread index和input index的对应关系如下:

  • 1 block, N threads(1D): 对应的线程索引threadIdx.x
  • 1 block, M X N threads(2D): 对应的线程索引blockIdx.y * blockDim.x + threadIdx.x
  • N block(1D), M threads(1D): 对应的线程索引blockIdx.x * gridDim.x + threadIdx.x

这里只列举部分情况。

CUDA中的并行计算模型是基于上面的对应关系的,常见的并行计算模型有如下几种:

  • Map
  • Gather
  • Scatter
  • Transpose
  • Reduce
  • Scan

下面逐个介绍这几种模型。

Map

Map就是将某个function作用到输入队列的每一个元素,然后用结果更新该队列:

比如,我们要将某个队列中的每个元素都加10,我们可以写类似这样的代码:

#include <curand.h>
#include <cuda.h>
#include <iostream>
#include <stdio.h>

__global__ void addTen(float* d, int count)

    int threadsPerBlock = blockDim.x * blockDim.y * blockDim.z;
    int threadPosInBlock = threadIdx.x + blockDim.x * threadIdx.y + blockDim.x * blockDim.y * threadIdx.z;
    int blockPosInGrid = blockIdx.x + gridDim.x * blockIdx.y; // grid is always 2D

    int threadIndex = blockPosInGrid * threadsPerBlock + threadPosInBlock;
    if (threadIndex < count) 
        d[threadIndex] += 10;
    




int main() 
    curandGenerator_t gen;
    curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MTGP32);
    curandSetPseudoRandomGeneratorSeed(gen, time(0));


    const int count = 123456;
    const int size = count * sizeof(float);
    float *d;
    float h[count];
    cudaMalloc(&d, size);
    curandGenerateUniform(gen, d, count);


    dim3 block(8, 8, 8);
    dim3 grid(16, 16);

    addTen<<<grid, block>>>(d, count);

    cudaMemcpy(h, d, size, cudaMemcpyDeviceToHost);

    cudaFree(d);

    for (int i = 0; i < 100; ++i) 
        std::cout << h[i] <<std::endl;
    

    getchar();

    return 0;



上面,将会产生0-1之间的随机均匀分布的数字,CUDA代码的作用是在该array中的每个数加10。注意这里用到了CUDA的curand库,因此编译时需要link到curand,如下:

nvcc -o Map Map.cu -lcurand

Gather

Gather “gathers” information from other parts of memory. Scatter “scatters” information to other parts of memory.[1]

Gather就是从内存已有的信息中提取信息,例如由给定array中的某些元素生成一个元素:

比如一般的stencil问题:


如果radius=3, 每个元素是相邻的7个元素之和,实现代码如下:

#include <thrust/device_vector.h>

#define RADIUS      3
#define BLOCKSIZE   32

/*******************/
/* iDivUp FUNCTION */
/*******************/
int iDivUp(int a, int b) return ((a % b) != 0) ? (a / b + 1) : (a / b); 

/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans)  gpuAssert((ans), __FILE__, __LINE__); 
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)

   if (code != cudaSuccess) 
   
      fprintf(stderr,"GPUassert: %s %s %d\\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   


/**********/
/* KERNEL */
/**********/
__global__ void moving_average(unsigned int *in, unsigned int *out, int N) 

    __shared__ unsigned int temp[BLOCKSIZE + 2 * RADIUS];

    int gindexx = threadIdx.x + blockIdx.x * blockDim.x;

    int lindexx = threadIdx.x + RADIUS;

    // --- Read input elements into shared memory
    temp[lindexx] = (gindexx < N)? in[gindexx] : 0;
    if (threadIdx.x < RADIUS) 
        temp[threadIdx.x] = (((gindexx - RADIUS) >= 0)&&(gindexx <= N)) ? in[gindexx - RADIUS] : 0;
        temp[threadIdx.x + (RADIUS + BLOCKSIZE)] = ((gindexx + BLOCKSIZE) < N)? in[gindexx + BLOCKSIZE] : 0;
    

    __syncthreads();

    // --- Apply the stencil
    unsigned int result = 0;
    for (int offset = -RADIUS ; offset <= RADIUS ; offset++) 
        result += temp[lindexx + offset];
    

    // --- Store the result
    out[gindexx] = result;


/********/
/* MAIN */
/********/
int main() 

    const int N        = 55 + 2 * RADIUS;

    const unsigned int constant = 4;

    thrust::device_vector<unsigned int> d_in(N, constant);
    thrust::device_vector<unsigned int> d_out(N);

    moving_average<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(thrust::raw_pointer_cast(d_in.data()), thrust::raw_pointer_cast(d_out.data()), N);
    gpuErrchk(cudaPeekAtLastError());
    gpuErrchk(cudaDeviceSynchronize());

    thrust::host_vector<unsigned int> h_out = d_out;

    for (int i=0; i<N; i++)
        printf("Element i = %i; h_out = %i\\n", i, h_out[i]);

    return 0;


Scatter

Scatter正好和Gather相反,这里有一个Scatter的比较好的解释:

When each thread needs to write its output in a different or multiple places we call this the scatter operation.

当需要把一个元素写到不同的位置或者写到多个位置时,就是scatter.
参考[3], 这里有一个用thrust实现的scatter的例子:

#include <thrust/scatter.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <vector>
#include <iostream>

int main() 

    int values[10] = 1, 0, 1, 0, 1, 0, 1, 0, 1, 0;
    thrust::device_vector<int> d_values(values, values + 10);
    // scatter all even indices into the first half of the range, and odd indices vice versa
    int map[10]   = 0, 5, 1, 6, 2, 7, 3, 8, 4, 9;
    thrust::device_vector<int> d_map(map, map + 10);
    thrust::device_vector<int> d_output(10);
    thrust::scatter(thrust::device, d_values.begin(), d_values.end(), d_map.begin(), d_output.begin());
    std::vector<int> result(d_output.size());
    thrust::copy(d_output.begin(), d_output.end(), result.begin());
    for(auto& e : result) 
        std::cout << e << std::endl;
    



Thurst 是CUDA的原生库,编译时不需要做额外的链接,如下即可:

nvcc -std=c++11 -o Scatter Scatter.cu

Transpose

Transpose通常涉及到矩阵的转置。

比如,我们有一个图片,它的pixel值信息是按行排列存储的:

那么,我们把它放到CUDA的内存中,它的排布是这样的:

但是,通常,我们使用图片时,可能采用按列顺序读取的方式,比如先第一列1,5;然后第二列2,6,…,最后4,8

那么,根据内存读取优化原则,相邻读取的元素在内存中相邻排布才有最好的性能,那么,我们实际希望的GPU内存排布是:

Transpose是一种特殊的scatter, 实现这个需求只需要把上面的代码稍微修改即可:

#include <thrust/scatter.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <vector>
#include <iostream>

int main() 

    int values[8] = 1, 2, 3, 4, 5, 6, 7, 8;
    thrust::device_vector<int> d_values(values, values + 8);
    
    // scatter first half to the dest even index and second half to the dest odd indices
    int map[8]   = 0, 2, 4, 6, 1, 3, 5, 7;
    thrust::device_vector<int> d_map(map, map + 8);
    thrust::device_vector<int> d_output(8);
    thrust::scatter(thrust::device, d_values.begin(), d_values.end(), d_map.begin(), d_output.begin());
    std::vector<int> result(d_output.size());
    thrust::copy(d_output.begin(), d_output.end(), result.begin());
    for(auto& e : result) 
        std::cout << e << std::endl;
    



Reduce

Reduce就是一种分而治之(divide and conquer)的方法,比如下面的求和:

在CUDA中可以采用如下的并行算法:

初始时,thread number是data size的一半,每个迭代,一个线程负责两个数的相加;每过一个迭代,data size进一步减半,因而需要使用的线程也减半。
对应的核代码为:

#include <cuda.h>
#include <iostream>
#include <stdio.h>

__global__ void sum(int* d)

    int tid = threadIdx.x;

    for (int threadCount = blockDim.x, stepSize = 1; threadCount > 0; threadCount /= 2, stepSize *= 2) 
        if (tid < threadCount) 
            int lhsIndex = tid * stepSize * 2;
            int rhsIndex = lhsIndex + stepSize;
            d[lhsIndex] += d[rhsIndex];
        
    




int main() 

    const int count = 8;
    const int size = count * sizeof(int);
    int *d;
    int h[count] = 1, 2, 3, 4, 5, 6, 7, 8;
    cudaMalloc(&d, size);
    cudaMemcpy(d, h, size, cudaMemcpyHostToDevice);


    sum<<<1, count / 2>>>(d);

    cudaMemcpy(h, d, size, cudaMemcpyDeviceToHost);

    cudaFree(d);

    for (int i = 0; i < count; ++i) 
        std::cout << h[i] <<std::endl;
    

    getchar();

    return 0;



程序输出为:

36
2
7
4
26
6
15
8

第一个数为我们想要的sum, 对比上图和程序,我们发现thread 0干的活最多,因为tid最小为0,threadCount不断减半,所以最后结果写到了d[0]处。

Scan

当输出队列的第n个元素有输入队列的前n个元素决定时,称这种计算模式为scan.

当f操作为求和时:

input12345
output1361015

这里,我们仍然采用1维的grid, 采用N-1个线程,index为0的线程仍然干最多的活:

#include <cuda.h>
#include <iostream>
#include <stdio.h>

__global__ void scanSum(int* d)

    int tid = threadIdx.x;

    for (int threadCount = blockDim.x, stepSize = 1; threadCount > 0; stepSize *= 2) 
        if (tid < threadCount) 
            d[tid + stepSize] += d[tid];
        
        threadCount -= stepSize;
    



int main() 

    const int count = 5;
    const int size = count * sizeof(int);
    int *d;
    int h[count] = 1, 2, 3, 4, 5;
    cudaMalloc(&d, size);
    cudaMemcpy(d, h, size, cudaMemcpyHostToDevice);


    scanSum<<<1, count - 1>>>(d);

    cudaMemcpy(h, d, size, cudaMemcpyDeviceToHost);

    cudaFree(d);

    for (int i = 0; i < count; ++i) 
        std::cout << h[i] <<std::endl;
    

    getchar();

    return 0;




Reference:

  1. https://developer.nvidia.com/gpugems/gpugems2/part-iv-general-purpose-computation-gpus-primer/chapter-31-mapping-computational
  2. https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda
  3. https://blog.csdn.net/seamanj/article/details/82976687
  4. https://thrust.github.io/doc/group__scattering_gac1b33a02dd67cf9c8b3b3aa05c881a37.html
  5. https://docs.nvidia.com/cuda/thrust/index.html

以上是关于CUDA中的几种并行计算模型的主要内容,如果未能解决你的问题,请参考以下文章

cuda并行计算的几种模式

使用 Cuda 并行实现计算大型数组中的大型连续子序列之和

模型推理谈谈 GPU 并行推理的几个方式

一文了解GPU并行计算CUDA

CUDA编程入门极简教程(转)

Cuda程序的设计-2