使用 CUDA 在 C 中并行化基数排序的问题

Posted 2023-02-22

技术标签:

【中文标题】使用 CUDA 在 C 中并行化基数排序的问题【英文标题】：Problems in parallelizing radix sort in C with CUDA 【发布时间】：2021-12-30 17:16:46 【问题描述】：

我正在尝试使用 C 中的 CUDA 实现一个基数排序算法，以便能够并行化它；代码如下：

#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <time.h>
#include <cuda_profiler_api.h>

int* populateArray(int * arr, int n)
    int j=0;
    for (int i=n; i>0 ; i--)
        arr[j]= i;
        j=j+1;
    
    return arr;
    

void printArray(int * array, int size)
    printf("Ordered array:\n");
    for (int j=0; j<size; j++)
        printf("%d ", array[j]);
    
    printf("\n");


__device__  int getMax(int* array, int n) 
  int max = array[0];
  for (int i = 1; i < n; i++)
    if (array[i] > max)
      max = array[i];
  return max;


__device__ void countingSort(int* array,int size,int digit, int index, int* output) 
  int count[10]=0;

  for (int i = 0; i < size; i++)
    count[(array[i] / digit) % 10]++;

  for (int i = 1; i < 10; i++)
    count[i] += count[i - 1];

  for (int i = size - 1; i >= 0; i--) 
    output[count[(array[i] /digit) % 10] - 1] = array[i];
    count[(array[i] / digit) % 10]--;
    

  for (int i = 0; i < size; i++)
    array[i]= output[i];
  


__global__ void radixsort(int* array, int size, int* output) 
  // define index
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  
  // check that the thread is not out of the vector boundary
  if (index >= size) return;

  int max = getMax(array, size);
  for (int digit = 1; max / digit > 0; digit *= 10)
    countingSort(array, size, digit, index, output);
    
 


int main(int argc, char *argv[]) 
    // Init array
    int n = 1000;
    int* array_h = (int*) malloc(sizeof(int) * n);
    populateArray(array_h, n);
    
    // allocate memory on device
    int* array_dev;
    cudaMalloc((void**)&array_dev, n*sizeof(int));
    int* output;
    cudaMalloc((void**)&output, n*sizeof(int));

    // copy data from host to device
    cudaMemcpy(array_dev, array_h, n*sizeof(int), cudaMemcpyHostToDevice);

    dim3 block(32);
    dim3 grid((n-1)/block.x + 1);
    printf("Number of threads for each block: %d\n", block.x);
    printf("Number of blocks in the grid: %d\n", grid.x);

    // Create start and stop CUDA events 
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);

    cudaError_t mycudaerror;
    mycudaerror = cudaGetLastError();
    // define the execution configuration
    radixsort<<<grid,block>>>(array_dev, n, output);

    // device synchronization and cudaGetLastError call
    mycudaerror = cudaGetLastError();
    if(mycudaerror != cudaSuccess)  
      fprintf(stderr,"%s\n",cudaGetErrorString(mycudaerror)) ;
      //printf("Error in kernel!");
      exit(1);
    

    // event record, synchronization, elapsed time and destruction
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float elapsed;
    cudaEventElapsedTime(&elapsed, start, stop);
    elapsed = elapsed/1000.f; // convert to seconds
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    printf("Number of elements in the array: %d\n", n);
    printf("Kernel elapsed time: %.5f s\n", elapsed);

    // copy back results from device
    cudaMemcpy(array_h, array_dev, n*sizeof(int), cudaMemcpyDeviceToHost);

    // print ordered array
    printArray(array_h, n);
    
    // free resources on device
    cudaFree(array_dev);
    cudaFree(output);

    // free resources on host
    free(array_h);

    cudaProfilerStop();
    return 0;

为了运行它，我使用的是 Google Colab。每个块的最大线程数固定为 32（网格变量），而使用的块数在 main 中根据需要排序的元素数量（块变量）计算。

当我开始更改要排序的数组中的元素数量（main 中存在变量“n”）时，就会出现问题，因为一旦超过某个阈值，排序将不再正确执行。

为了获取更多关于这个错误执行的信息，我也使用了命令cuda-memcheck和nvcc -lineinfo，发现错误是由于内核中的这行代码造成的：

output[count[(array[i] /digit) % 10] - 1] = array[i];

多次尝试，错误似乎主要在于计算我试图写入的“输出”数组的索引；但是，例如，当我尝试对 32 或 64 个元素进行排序时，不会发生此错误。因此，我想知道我是否在代码中做错了什么，或者简单地说，基数排序在我尝试的方式中是不可并行的。我知道让每个线程在不使用相对于每个线程的索引的情况下进行排序从计算上来说是非常繁重的，但首先我想尝试解决这个问题，然后尝试总体上优化代码。

各种经过验证的方法包括：

使用原子指令；在内核中声明“输出”数组，不要将其作为 main 的指针传递；仅使用一个线程块来计算已排序的数组（这种方法似乎很有效，但考虑到块数的动态分配，这是无用的）

感谢您的任何回复。

【问题讨论】：

很遗憾我学习CUDA的时间很短，所以我对所有的基础知识都不是很了解，谢谢你的回答 【参考方案1】：

你似乎已经明白这不是解决这个问题的方法：

我知道让每个线程在不使用相对于每个线程的索引的情况下进行排序在计算上是非常繁重的，但首先我想尝试解决这个问题，然后尝试总体上优化代码。

几乎任何本质上是串行的代码都可以放在 CUDA 内核中，在单线程中运行，并且应该产生相同的结果。然而，这不是编写 CUDA 代码的方式；表现会很惨淡。

在许多情况下，本质上是串行的算法不容易适应并行化。您的排序方法就是其中之一。一种（有效）并行化基数排序的典型方法是here。

尽管如此，将串行算法放入单个线程“应该”工作。当您在多个线程中（不必要地）运行相同的串行代码时，就会出现问题。

如果所有线程都完美地步调一致，那么它们都会在相同的指令或时钟周期内做完全相同的事情。但是 GPU 不是这样工作的。您可以看到的最大“锁步”是在经线级别（线程块中有 32 个线程）。一旦你去多个warp，无论它们是在同一个线程块中还是不同的线程块中，那么你的串行代码中的不同点都会有线程，有效地相互踩踏。

一个简单的证明点是将您的网格配置更改为单个线程：

dim3 block(1);
dim3 grid(1);

然后错误消失，即使对于n = 1000，数组也正确排序。由于单一经线的典型锁步性质，这也应该有效：

dim3 block(32);
dim3 grid(1);

某些较大的配置在某些情况下可能会起作用，但如果块数量足够多，最终所有线程不会同时启动，这会导致麻烦。

【讨论】：

感谢罗伯特的回答。实际上，我的目标是实现一个可以处理大量元素（例如：1000 万个元素）的基数排序算法。我还查看了您推荐的实现，我注意到它特别适用于 32 个元素：您知道是否有一个实现允许我处理更多元素（因此，能够使用更多线程/块）？我也通过 CUDA Thrust 做了一个实现，但我想在没有它的情况下实现一个。我不确定你是否理解我在那里写的内容。该答案中描述的方法不限于特定的数据集大小。是的，我给出的特定示例代码仅限于 32 个元素，但它是用于教学目的。该方法（如 udacity 所教导的）可扩展到任意数据集大小，并选择适当的并行扫描算法。如果您想实现自己的，那是 udacity 为教学目的提供的方法。如果你想要一个已经编写好的并且可以运行的，thrust 和 cub 都是开源的。如果您认为我的回答令人困惑，请假装它不存在并阅读另一个。您也可以研究 udacity 材料。再次感谢您的回答。应该如何修改扫描以允许算法使用超过 32 个元素？我尝试自己修改代码，但没有成功这不是小事。您可以通过在cuda 标签上研究有关它的问题来了解如何编写大规模扫描操作，或者您可以使用来自例如的大规模扫描操作。 cub 或 thrust.【参考方案2】：

另一种解决方案是使用具有最高有效位基数排序的单个线程将数组分成多个 bin，例如，如果使用 base 256，则为 256 个 bin 在此初始步骤之后，每个 bin 都可以独立排序并在并行使用传统的最低有效位第一基数排序。为了使其正常工作，数据需要合理统一（至少对于最高有效数字），以便多个 bin 的大小有点相似。

尝试并行执行初始步骤很复杂。

【讨论】：

以上是关于使用 CUDA 在 C 中并行化基数排序的问题的主要内容，如果未能解决你的问题，请参考以下文章

基数排序及其并行化

C++ 并行就地基数排序

基数排序的并行版本未按预期运行（Java）

C中基数排序的不同基础

c中的基数排序算法[关闭]

使用C实现基数排序