CUDA 结果使用非常大的数组返回垃圾，但不报告错误

Posted 2023-04-15

技术标签:

【中文标题】CUDA 结果使用非常大的数组返回垃圾，但不报告错误【英文标题】：CUDA result returns garbage using very large array, but reports no error 【发布时间】：2012-11-11 23:54:22 【问题描述】：

我正在创建一个测试程序，它将创建一个设备和一个大小为 n 的主机数组，然后启动一个内核来创建 n 个线程，该线程分配常量值 0.95 f 到设备阵列中的每个位置。完成后，将设备数组复制到主机数组，并对所有条目进行总计，并显示最终总计。

下面的程序似乎可以很好地处理高达大约 6000 万浮点数的数组，并很快返回正确的结果，但是在达到 7000 万时，程序似乎会挂起一段时间，并最终返回总数的 NAN 结果。在运行 6000 万次后检查主机阵列显示它正确填充了 0.95f，但在运行 7000 万次后检查它显示它填充了 NAN。据我所知，没有任何 CUDA 调用返回错误。

我使用的是 2GB GT640m（Compute 3.0），最大块大小为 1024，最大网格尺寸为 2147483647。

我确信有更好的方法来实现类似的目标，我想听听建议。但我也想了解这里出了什么问题，以便从中吸取教训。

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <fstream>

void cudaErrorHandler(cudaError_t status)

    // Cuda call returned an error, just print error for now
    if(status != cudaSuccess)
    
        printf("Error");
    


__global__ void addKernel(float* _Results, int _TotalCombinations)

    // Get thread Id
    unsigned int Id = (blockDim.x * blockDim.y * blockIdx.x) + (blockDim.x * threadIdx.y) + threadIdx.x;

    //If the Id is within simulation range, log it
    if(Id < _TotalCombinations)
    
        _Results[Id] = 0.95f;
    


#define BLOCK_DIM_X 32
#define BLOCK_DIM_Y 32
#define BLOCK_SIZE BLOCK_DIM_X * BLOCK_DIM_Y // Statc block size of 32*32 (1024)
#define CUDA_CALL(x) cudaErrorHandler(x)

int main()

    // The number of simulations to run
    unsigned int totalCombinations = 45000000;

    int gridsize = 1;

    // Work out how many blocks of size 1024 are required to perform all of totalCombinations
    for(unsigned int totalsize = gridsize * BLOCK_SIZE; totalsize < totalCombinations; 
        gridsize++, totalsize = gridsize * BLOCK_SIZE)
        ;

    // Allocate host memory
    float* host_results = new float[totalCombinations];
    memset(host_results, 0, sizeof(float) * totalCombinations);
    float *dev_results = 0;

    cudaSetDevice(0);

    // Allocate device memory
    CUDA_CALL(cudaMalloc((void**)&dev_results, totalCombinations * sizeof(float)));

    dim3 grid, block;

    block = dim3(BLOCK_DIM_X, BLOCK_DIM_Y);

    grid = dim3(gridsize);

    // Launch kernel
    addKernel<<<gridsize, block>>>(dev_results, totalCombinations);

    // Wait for synchronize
    CUDA_CALL(cudaDeviceSynchronize());

    // Copy device data back to host
    CUDA_CALL(cudaMemcpy(host_results, dev_results, totalCombinations * sizeof(float), cudaMemcpyDeviceToHost));

    double total = 0.0;

    // Total the results in the host array
    for(unsigned int i = 0; i < totalCombinations; i++)
        total+=host_results[i];

    // Print results to screen
    printf("Total %f\n", total);

    delete[] host_results;

    return 0;

【问题讨论】：

您的错误处理方法不起作用。为了证明这一点，将你的块 dim x 和 y 更改为 50（产生 2500 个线程，这是非法的）并且不会打印出任何错误。如果你修复你的错误处理，你会发现问题。在您的故障点无法正常工作的原因是您的网格大小（您正在启动一维网格）超过了 X 维度中的最大网格大小（默认为 65535）。如果您想利用较大的网格大小，则需要使用-arch=sm_30 开关进行编译。另请注意，您的块 X 尺寸为 22，不建议这样做。 @RobertCrovella 我纠正了我的错误检查（以及块 X，这是我的错字），然后将开关添加到命令行，这解决了我的问题，一切运行正常。如果您希望将此作为答案提交，我会接受。 【参考方案1】：

正如您所发现的，您的错误处理方法不起作用。下面我粘贴了一个带有我经常使用的错误检查方法的代码版本。事情在您的故障点不起作用的原因是您的网格大小（您正在启动一维网格）超过了 X 维度中的最大网格大小（默认为 65535，即计算能力高达 2.x）。如果您想利用更大的网格尺寸（2^31 -1 是计算能力 3.0 的限制），您需要使用-arch=sm_30 开关进行编译。

这里仅供参考，您的代码版本显示了我经常使用的错误检查方法。

#include <stdio.h>
#include <fstream>


#define cudaCheckErrors(msg) \
    do  \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess)  \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
         \
     while (0)

__global__ void addKernel(float* _Results, int _TotalCombinations)

    // Get thread Id
    unsigned int Id = (blockDim.x * blockDim.y * blockIdx.x) + (blockDim.x * threadIdx.y) + threadIdx.x;

    //If the Id is within simulation range, log it
    if(Id < _TotalCombinations)
    
        _Results[Id] = 0.95f;
    


#define BLOCK_DIM_X 32
#define BLOCK_DIM_Y 32
#define BLOCK_SIZE BLOCK_DIM_X * BLOCK_DIM_Y // Statc block size of 32*32 (1024)

int main()

    // The number of simulations to run
    unsigned int totalCombinations = 65000000;

    int gridsize = 1;

    // Work out how many blocks of size 1024 are required to perform all of totalCombinations
    for(unsigned int totalsize = gridsize * BLOCK_SIZE; totalsize < totalCombinations;
        gridsize++, totalsize = gridsize * BLOCK_SIZE)
        ;
    printf("gridsize = %d, blocksize = %d\n", gridsize, BLOCK_SIZE);
    // Allocate host memory
    float* host_results = new float[totalCombinations];
    memset(host_results, 0, sizeof(float) * totalCombinations);
    float *dev_results = 0;

    cudaSetDevice(0);

    // Allocate device memory
    cudaMalloc((void**)&dev_results, totalCombinations * sizeof(float));
    cudaCheckErrors("cudaMalloc fail");

    dim3 grid, block;

    block = dim3(BLOCK_DIM_X, BLOCK_DIM_Y);

    grid = dim3(gridsize);

    // Launch kernel
    addKernel<<<gridsize, block>>>(dev_results, totalCombinations);
    cudaCheckErrors("kernel fail");
    // Wait for synchronize
    cudaDeviceSynchronize();
    cudaCheckErrors("sync fail");

    // Copy device data back to host
    cudaMemcpy(host_results, dev_results, totalCombinations * sizeof(float), cudaMemcpyDeviceToHost);
    cudaCheckErrors("cudaMemcpy 2 fail");

    double total = 0.0;

    // Total the results in the host array
    for(unsigned int i = 0; i < totalCombinations; i++)
        total+=host_results[i];

    // Print results to screen
    printf("Total %f\n", total);

    delete[] host_results;

    return 0;

【讨论】：

感谢您的帮助，以后我会使用这种错误检查方法。很好的答案。即使在今天也帮助了我。谢谢！

以上是关于CUDA 结果使用非常大的数组返回垃圾，但不报告错误的主要内容，如果未能解决你的问题，请参考以下文章

奇偶排序：在 CUDA 中使用多个块时结果不正确

在 CUDA 中同步多个变量

如何在输出中不包含 NaN 的情况下计算一个非常大的数组？

CUDA：struct的共享数据成员和该struct的引用成员具有不同的地址，值

如何使用CUDA并行化嵌套for循环以在2D数组上执行计算

memset显示错误的结果[重复]