Cuda for 循环疑问

Posted

技术标签:

【中文标题】Cuda for 循环疑问【英文标题】:Cuda for loop doubts 【发布时间】:2013-01-24 03:30:57 【问题描述】:

我研究了 cuda 内核函数内部的 for 循环。我将数据分成行和列。我不明白为什么“fillFirstCells_kernel”不能为我提供正确的结果。 “fillFirstCells_kernel1”函数提供了预期的结果。

任何cmets。

下面是代码:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define ROWS    10
#define COLS    4

__global__ void fillData_kernel(int len,
                                unsigned int* data)

    int offset = blockIdx.x + blockDim.x * threadIdx.x;

    if (offset < len)
    
        data[offset] = offset;

        offset += blockDim.x * gridDim.x;
    


__global__ void fillFirstCells_kernel(unsigned int *data,
                                      unsigned int *result)

    int offset = blockIdx.x + blockDim.x * threadIdx.x;

    while (offset < ROWS)
    
        for (int i=0; i<4; i++)
        
            result[offset] += data[offset*COLS+i];      
            //result[offset] += data[offset*COLS];
            //atomicAdd(&result[offset], 1);
        

        offset += blockDim.x * gridDim.x;
    


__global__ void fillFirstCells_kernel1(unsigned int *data,
                                      unsigned int *result)

    int offset = blockIdx.x + blockDim.x * threadIdx.x;

    if (offset < ROWS)
    
        result[offset] = data[offset*COLS] + 
                         data[offset*COLS+1] +
                         data[offset*COLS+2] +
                         data[offset*COLS+3];

        //offset += blockDim.x * gridDim.x;
    


void displayIntOutput(unsigned int* data)

    for (unsigned int i=0; i<ROWS; i++)
    
        for (unsigned int j=0; j<COLS; j++)
        
            printf("%08x ", data[i*COLS+j]);
        
        printf("\n");
    

    printf("\n");



void main()

    unsigned int *h_data, *h_filled_data;
    unsigned int *d_filled_data;

    int size = ROWS * COLS * sizeof(unsigned int);

    h_data = (unsigned int*)malloc(size);
    h_filled_data = (unsigned int*)malloc(size);

    cudaMalloc((void**)&d_filled_data, size);
    cudaMemset(d_filled_data, 0, size);

    dim3 threads(8, 1);
    dim3 blocks(ceil((float)size/(float)threads.x), 1);

    fillData_kernel<<<blocks,threads>>>(size, d_filled_data);

    cudaMemcpy(h_filled_data, d_filled_data, size, cudaMemcpyDeviceToHost);

    displayIntOutput(h_filled_data);


    unsigned int *h_first_item_in_col;
    unsigned int *d_first_item_in_col;

    h_first_item_in_col = (unsigned int*)malloc(ROWS*sizeof(int));
    memset(h_first_item_in_col, 0, ROWS*sizeof(int));

    cudaMalloc((void**)&d_first_item_in_col, ROWS*sizeof(int));
    cudaMemset(d_first_item_in_col, 0, ROWS*sizeof(int));

    dim3 threads_first(8, 1);
    dim3 blocks_first(ceil((float)(ROWS*sizeof(int))/(float)threads_first.x), 1);

    fillFirstCells_kernel<<<blocks, threads>>>(d_filled_data, d_first_item_in_col);

    cudaMemcpy(h_first_item_in_col, d_first_item_in_col, ROWS*sizeof(int), cudaMemcpyDeviceToHost);

    for (int i=0; i<ROWS; i++)
        printf("%d. %08x\n", (i+1), h_first_item_in_col[i]);

    cudaFree(d_filled_data);
    cudaFree(d_first_item_in_col);

    free(h_data);
    free(h_filled_data);
    free(h_first_item_in_col);

    system("pause");

内核“fillFirstCells_kernel”的输出:

00000000 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008 00000009 0000000a 0000000b 0000000c 0000000d 0000000e 0000000f 00000010 00000011 00000012 00000013 00000014 00000015 00000016 00000017 00000018 00000019 0000001a 0000001b 0000001c 0000001d 0000001e 0000001f 00000020 00000021 00000022 00000023 00000024 00000025 00000026 00000027

    00000006 00000016 00000026 00000036 00000046 00000056 00000066 00000076 0000010c 0000012c

内核“fillFirstCells_kernel”的输出:

00000000 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008 00000009 0000000a 0000000b 0000000c 0000000d 0000000e 0000000f 00000010 00000011 00000012 00000013 00000014 00000015 00000016 00000017 00000018 00000019 0000001a 0000001b 0000001c 0000001d 0000001e 0000001f 00000020 00000021 00000022 00000023 00000024 00000025 00000026 00000027

    00000006 00000016 00000026 00000036 00000046 00000056 00000066 00000076 00000086 00000096

【问题讨论】:

【参考方案1】:

您犯了几个错误。

int offset = blockIdx.x + blockDim.x * threadIdx.x;

应该是:

int offset = blockIdx.x * blockDim.x + threadIdx.x;

在启动内核时,您应该为每个元素而不是字节启动一个线程。

dim3 blocks(ceil((float)size/(float)threads.x), 1);

应该是

dim3 blocks(ceil((float)ROWS*COLS/(float)threads.x), 1);

核函数的长度参数错误:

fillData_kernel<<<blocks,threads>>>(size, d_filled_data);

应该是:

fillData_kernel<<<blocks,threads>>>(ROWS*COLS, d_filled_data);

我想就是这样。

【讨论】:

以上是关于Cuda for 循环疑问的主要内容,如果未能解决你的问题,请参考以下文章

mysql prepare疑问

为啥 torch.version.cuda 和 deviceQuery 报告不同的版本?

带有 for 循环的日期数组

菜鸟关于Windows消息循环的疑问

For循环在创建文件时出错[重复]

使用带有两个变量的 for 循环