在 CUDA 内核中访问 2D 数据
Posted
技术标签:
【中文标题】在 CUDA 内核中访问 2D 数据【英文标题】:Accessing 2D data in a CUDA kernel [duplicate] 【发布时间】:2021-08-12 17:59:09 【问题描述】:我正在为我的大学做作业,主要想法是将 CUDA 数据并行性与 CUDA 任务并行性进行比较。我想出了一个想法来并行化康威的生活游戏。问题是,我无法弄清楚如何在 CUDA 中以多个方向导航二维数组,即上/下/右/左以及内核评估的单元格周围的角。
到目前为止,我想出了以下几点:
第一个内核代码
//determines the alive cell and save value of each cell into an array
__global__ void numAliveAround(int *oldBoard, int *newBoard, int xSize, int ySize, size_t pitchOld, size_t pitchNew)
int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;
if(x < xSize && y < ySize)
//cell above
//xMod is to make sure the number wraps when it overflows the board
xMod = ((x + 1) % xSize + xSize) % xSize;
//idx calculation
idx = xMod * xSize + y;
outputNumber += board[idx];
//more of the same code, just for cell under, left, right, and corners
newBoard[x * xSize + y] = outputNumber;
第二个内核代码
//sets new cell status according to the number of alive cells around
__global__ void determineNextState(int *board, int *newBoard, int xSize, int ySize, size_t pitchOld, size_t pitchNew)
//getting threads
int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;
if (x < xSize && y < ySize)
int idxNew = x * xSize + y;
int idxOld = x * xSize + y;
int state = board[idxOld];
//ALIVE = 1, DEAD = 0;
int output = DEAD;
//checking if any alive condition is met
if (state == ALIVE)
if ((newBoard[idxNew] == 2 || newBoard[idxNew] == 3))
output = ALIVE;
else
if (newBoard[idxNew] == 3)
output = ALIVE;
newBoard[idxNew] = output;
内核调用函数
void SendToCUDA(int oldBoard[COLUMNS][ROWS], int newBoard[COLUMNS][ROWS])
//CUDA pointers
int *d_oldBoard;
int *d_newBoard;
size_t pitchOld;
size_t pitchNew;
cudaMallocPitch(&d_oldBoard, &pitchOld, COLUMNS * sizeof(int), ROWS);
cudaMallocPitch(&d_newBoard, &pitchNew, COLUMNS * sizeof(int), ROWS);
cudaMemcpy2D(d_oldBoard, pitchOld, oldBoard, COLUMNS * sizeof(int), COLUMNS * sizeof(int), ROWS, cudaMemcpyHostToDevice);
dim3 grid(divideAndRound(COLUMNS, BLOCKSIZE_X), divideAndRound(ROWS, BLOCKSIZE_Y));
dim3 block(BLOCKSIZE_Y, BLOCKSIZE_X);
printf("counting \n");
numberAliveAround <<<block, grid>>> (d_oldBoard, d_newBoard, COLUMNS, ROWS, pitchOld, pitchNew);
cudaDeviceSynchronize();
printf("determining \n");
determineNextState <<<block, grid>>> (d_oldBoard, d_newBoard, COLUMNS, ROWS, pitchOld, pitchNew);
cudaDeviceSynchronize();
//using newBoard later (outside the function) to display the Board
cudaMemcpy2D(newBoard, COLUMNS * sizeof(int), d_newBoard, pitchNew, COLUMNS * sizeof(int), ROWS, cudaMemcpyDeviceToHost);
cudaFree(d_oldBoard);
cudaFree(d_newBoard);
我发现了多种访问扁平二维数组的方法,其中一些相互矛盾,例如:
//what is usually used as an exmplanation
idx = x * widht + y;
//sometimes x and y are swapped
idx = y * width + x;
//what works with simple access
int *value = (int *)((char *)(d_matrix + y * pitch)) + x;
//or
idx = x * xDim + y + pitch;
有趣的是,当我只访问数组中的一个点(例如将其中的所有值增加 1)时,后面的 2 个可以工作,但完全不适用于更复杂的导航。在这一点上,我已经坐在这个问题上很长一段时间了。因此,任何形式的见解都会非常有帮助。
【问题讨论】:
【参考方案1】:我确实想出了答案,即在cudaMalloc2D之后访问二维数组的正确方法是:
board[y * (pitch / sizeof(int)) + x]
因为pitch是以字节为单位的长度,所以当通过[]操作符对数组进行索引时,必须首先将其与数据类型对齐。
pitch / sizeof(datatype)
后来我发现这段代码有更多问题,所以请不要只是复制它。
【讨论】:
这是不正确的,它假定pitch
可以被元素大小整除,CUDA 对此不做任何保证。 cudaMallocPitch
的 documentation 中给出了正确的方法(CUDA 中没有 cudaMalloc2D
。)以上是关于在 CUDA 内核中访问 2D 数据的主要内容,如果未能解决你的问题,请参考以下文章