CUDA 索引无法按预期工作

Posted 2023-04-15

技术标签:

【中文标题】CUDA 索引无法按预期工作【英文标题】：CUDA indexing does not work as expected 【发布时间】：2016-07-01 19:52:56 【问题描述】：

我正在尝试使用 PyCUDA 处理二维数组，我需要每个线程的 x,y 坐标。

here 和 here 已经提出并回答了这个问题，但链接的解决方案不适用于超过我的块大小的 2D 数据。为什么？

这是我用来帮助解决这个问题的 SourceModule：

mod = SourceModule("""
  __global__ void kIndexTest(float *M, float *X, float*Y)
  
    int bIdx = blockIdx.x + blockIdx.y * gridDim.x; 
    int idx = bIdx * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x;

    /* this array shows me the unique thread indices */
    M[idx] = idx;

    /* these arrays should capture x, y for each unique index */    
    X[idx] = (blockDim.x * blockIdx.x) + threadIdx.x;
    Y[idx] = (blockDim.y * blockIdx.y) + threadIdx.y;

  
  """)

我正在这样执行内核：

gIndexTest = mod.get_function("kIndexTest")

dims = (8, 8)

M = gpuarray.to_gpu(numpy.zeros(dims, dtype=numpy.float32))
X = gpuarray.to_gpu(numpy.zeros(dims, dtype=numpy.float32))
Y = gpuarray.to_gpu(numpy.zeros(dims, dtype=numpy.float32))

gIndexTest(M, X, Y, block=(4, 4, 1), grid=(2, 2, 1))

M 为我测试过的所有维度和所有块/网格配置返回正确的索引。 X 和 Y 仅在 X 和 Y 的尺寸与块尺寸相同时返回正确的坐标值，但不返回我期望的其他值。例如，上面的配置产生：

M:
[[  0.   1.   2.   3.   4.   5.   6.   7.]
 [  8.   9.  10.  11.  12.  13.  14.  15.]
 [ 16.  17.  18.  19.  20.  21.  22.  23.]
 [ 24.  25.  26.  27.  28.  29.  30.  31.]
 [ 32.  33.  34.  35.  36.  37.  38.  39.]
 [ 40.  41.  42.  43.  44.  45.  46.  47.]
 [ 48.  49.  50.  51.  52.  53.  54.  55.]
 [ 56.  57.  58.  59.  60.  61.  62.  63.]] (correct)

X:
[[ 0.  1.  2.  3.  0.  1.  2.  3.]
 [ 0.  1.  2.  3.  0.  1.  2.  3.]
 [ 4.  5.  6.  7.  4.  5.  6.  7.]
 [ 4.  5.  6.  7.  4.  5.  6.  7.]
 [ 0.  1.  2.  3.  0.  1.  2.  3.]
 [ 0.  1.  2.  3.  0.  1.  2.  3.]
 [ 4.  5.  6.  7.  4.  5.  6.  7.]
 [ 4.  5.  6.  7.  4.  5.  6.  7.]] (not what I expect)

Y:
[[ 0.  0.  0.  0.  1.  1.  1.  1.]
 [ 2.  2.  2.  2.  3.  3.  3.  3.]
 [ 0.  0.  0.  0.  1.  1.  1.  1.]
 [ 2.  2.  2.  2.  3.  3.  3.  3.]
 [ 4.  4.  4.  4.  5.  5.  5.  5.]
 [ 6.  6.  6.  6.  7.  7.  7.  7.]
 [ 4.  4.  4.  4.  5.  5.  5.  5.]
 [ 6.  6.  6.  6.  7.  7.  7.  7.]] (not what I expect)

这是我对 X 和 Y 的实际期望：

X:
[[ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 0.  1.  2.  3.  4.  5.  6.  7.]
 [ 0.  1.  2.  3.  4.  5.  6.  7.]] (only works when X dims = block dims)

Y:
[[ 0.  0.  0.  0.  0.  0.  0.  0.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.]
 [ 2.  2.  2.  2.  2.  2.  2.  2.]
 [ 3.  3.  3.  3.  3.  3.  3.  3.]
 [ 4.  4.  4.  4.  4.  4.  4.  4.]
 [ 5.  5.  5.  5.  5.  5.  5.  5.]
 [ 6.  6.  6.  6.  6.  6.  6.  6.]
 [ 7.  7.  7.  7.  7.  7.  7.  7.]] (only works when Y dims = block dims)

我不明白什么？

这是我的设备查询：

Device 0: "GeForce GT 755M"
  CUDA Driver Version / Runtime Version          7.5 / 6.5
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1024 MBytes (1073283072 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Clock rate:                                1085 MHz (1.09 GHz)
  Memory Clock rate:                             2500 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           1 / 0

【问题讨论】：

【参考方案1】：

一切都“像宣传的那样”工作。这里的问题是您将不兼容的索引方案混合在一起，这会产生不一致的结果。

如果您希望 X 和 Y 按预期显示，则需要以不同的方式计算 idx：

  __global__ void kIndexTest(float *M, float *X, float*Y)
  
    int xidx = (blockDim.x * blockIdx.x) + threadIdx.x;
    int yidx = (blockDim.y * blockIdx.y) + threadIdx.y;
    int idx = (gridDim.x * blockDim.x * yidx) + xidx;

    X[idx] = xidx;
    Y[idx] = yidx;
    M[idx] = idx;

在此方案中，xidx 和 yidx 是网格 x 和 y 坐标，idx 是全局索引，均假设列主要排序（即 x 是变化最快的维度）。

【讨论】：

啊哈！谢谢你。为了回答这个问题，我不明白的是我的索引方案并不总是将 x 和 y 放置在 X 和 Y 内的适当 x、y 位置。不是因为 x、y 计算有缺陷，而是因为索引计算是。 @DarienCrane：不，这不是正确的解释。您的idx 计算没有“缺陷”，它只是在网格中计算唯一索引方案的另一种方式。（至少）有四种不同的方法可以计算 2D 网格中的唯一索引，您只是选择混合两种不兼容的方法。对于这个应用程序来说这是一个有缺陷的方法。我的假设是我应该首先生成一个唯一的 'idx'，然后使用它将 x 和 y 映射到 X 和 Y。我不知道我使用了两种不同的索引方案，所以你提到的不兼容从来没有出现在我的脑海中.

以上是关于CUDA 索引无法按预期工作的主要内容，如果未能解决你的问题，请参考以下文章