在 Numpy 数组上使用 Pycuda 的 GPU 数组乘法

Posted 2023-02-16

技术标签:

【中文标题】在 Numpy 数组上使用 Pycuda 的 GPU 数组乘法【英文标题】：GPU Array multiplications using Pycuda on Numpy arrays 【发布时间】：2019-06-27 13:56:49 【问题描述】：

我试图通过制作类似的 GPU 数组并执行操作来实现两个 numpy 数组的元素乘法。 但是，生成的执行时间比原始的 numpy 逐点乘法要慢得多。我希望使用 GPU 获得良好的加速。 zz0 是 complex128 类型，(64,256,16) 形状 numpy 数组，xx0 是 float64 类型，(16,151) 形状 numpy 数组。有人可以帮我弄清楚我在实施方面做错了什么：

import sys
import numpy as np
import matplotlib.pyplot as plt
import pdb
import time

import pycuda.driver as drv
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda.elementwise import ElementwiseKernel
import pycuda.gpuarray as gpuarray
import pycuda.cumath
import skcuda.linalg as linalg

linalg.init()

# Function for doing a point-wise multiplication using GPU
def calc_Hyp(zz,xx):
    zz_stretch = np.tile(zz, (1,1,1,xx.shape[3]))
    xx_stretch = np.tile(xx, (zz.shape[0],zz.shape[1],1,1))
    zzg = gpuarray.to_gpu(zz_stretch)
    xxg = gpuarray.to_gpu(xx_stretch)
    zz_Hypg = linalg.multiply(zzg,xxg)
    zz_Hyp = zz_Hypg.get()
    return zz_Hyp


zz0 = np.random.uniform(10.0/5000, 20000.0/5000, (64,256,16)).astype('complex128')
xx0 = np.random.uniform(10.0/5000, 20000.0/5000, (16,151)).astype('float64')

xx0_exp = np.exp(-1j*xx0)

t1 = time.time()

#Using GPU for the calculation
zz0_Hyp = calc_Hyp(zz0[:,:,:,None],xx0_exp[None,None,:,:])
#np.save('zz0_Hyp',zz0_Hyp)

t2 = time.time()
print('Time taken with GPU:'.format(t2-t1))

#Original calculation
zz0_Hyp_actual = zz0[:,:,:,None]*xx0_exp[None,None,:,:]
#np.save('zz0_Hyp_actual',zz0_Hyp_actual)

t3 = time.time()
print('Time taken without GPU:'.format(t3-t2))

【问题讨论】：

MCVE，你为什么不给我们看看 zz0？我们无法运行它，也没有 x0。你应该很容易提供一个小例子，不需要我们在头脑中编造东西。理想情况下，我应该能够获取您拥有的代码，并且假设我安装了 pycuda 和 skcuda，它应该可以正常工作（此处不需要 matplot lib）。如果你没有想出我们可以复制的东西，你的问题就会被删除。感谢您的回复。我已经用我的工作测试代码更新了这个问题。我在运行你的代码后得到interruped by signal 9: SIGKILL。您为此使用了太多的内存。你能做一个不这样做的版本吗？我不认为有必要分配几 GB 的 RAM 来测试本质上是元素明智的乘法。我只是缩小了尺寸，至少能够重现结果。 【参考方案1】：

第一个问题是您的计时指标不准确。

Linalg 即时编译 cuda 模块，您可能会看到代码在运行时编译。我对您的代码进行了一些细微的修改，以减少被相乘的数组的大小，但无论如何，经过两次没有其他改进的运行后，我看到了性能的巨大提升，例如：

Time taken with GPU:2.5476348400115967
Time taken without GPU:0.16627931594848633

对

Time taken with GPU:0.8741757869720459
Time taken without GPU:0.15836167335510254

但是，这仍然比 CPU 版本慢得多。我做的下一件事是根据实际计算发生的位置给出更准确的时间。你没有在你的 numpy 版本中平铺，所以不要在你的 cuda 版本中计时：

REAL Time taken with GPU:0.6461708545684814

您还可以复制到 GPU，并将其包含在计算中，但这本身需要相当长的时间，所以让我们删除它：

t1 = time.time()
zz_Hypg = linalg.multiply(zzg,xxg)
t2 = time.time()
...
REAL Time taken with GPU:0.3689603805541992

哇，贡献很大。但是我们仍然比 numpy 版本慢？为什么？

还记得我说过 numpy 不平铺吗？它根本不会复制内存以进行广播。要获得真正的速度，您必须：

不是平铺广播尺寸在内核中实现它。

Pycuda 提供用于内核实现的实用程序，但其 GPU 阵列不提供广播。基本上你要做的就是这个（免责声明：我没有测试过这个，可能有错误，这只是为了大致展示内核应该是什么样子）：

#include <pycuda-complex.hpp>
//KERNEL CODE
constexpr unsigned work_tile_dim = 32
//instruction level parallelism factor, how much extra work to do per thread, may be changed but effects the launch dimensions. thread group size should be (tile_factor, tile_factor/ilp_factor)
constexpr unsigned ilp_factor = 4
//assuming c order: 
//    x axis contiguous out, 
//    y axis contiguous in zz, 
//    x axis contiguous in xx
// using restrict because we know that all pointers will refer to different parts of memory. 
__global__ 
void element_wise_multiplication(
    pycuda::complex<double>* __restrict__ array_zz, 
    pycuda::complex<double>* __restrict__ array_xx,
    pycuda::complex<double>* __restrict__ out_array,
    unsigned array_zz_w, /*size of w,z,y, dimensions used in zz*/
    unsigned array_zz_z,
    unsigned array_zz_xx_y,/*size of y,x, dimensions used in xx, but both have same y*/
    unsigned array_xx_x)


    // z dimensions in blocks often have restrictions on size that can be fairly small, and sometimes can cause performance issues on older cards, we are going to derive x,y,z,w index from just the x and y indicies instead. 
    unsigned x_idx = blockIdx.x * (work_tile_dim) + threadIdx.x
    unsigned y_idx = blockIdx.y * (work_tile_dim) + threadIdx.y
    //blockIdx.z stores both z and w and should not over shoot, and aren't used
    //shown for the sake of how to get these dimensions. 
    unsigned z_idx = blockIdx.z % array_zz_z;
    unsigned w_idx = blockIdx.z / array_zz_z;
    //we already know this part of the indexing calculation. 
    unsigned out_idx_zw = blockIdx.z * (array_zz_xx_y * array_xx_z);
    // since our input array is actually 3D, this is a different calcualation
    unsigned array_zz_zw = blockIdx.z * (array_zz_xx_y)
    //ensures if our launch dimensions don't exactly match our input size, we don't 
    //accidently access out of bound memory, while branching can be bad, this isn't 
    // because 99.999% of the time no branch will occur and our instruction pointer 
    //will be the same per warp, meaning virtually zero cost. 
    if(x_idx < array_xx_x)
        //moving over y axis to coalesce memory accesses in the x dimension per warp. 
        for(int i = 0; i < ilp_factor; ++i)
            //need to also check y, these checks are virtually cost-less 
            // because memory access dominates time in such simple calculations, 
            // and arithmetic will be hidden by overlapping execution 
            if((y_idx+i) < array_zz_xx_y)
                //splitting up calculation for simplicity sake
                out_array_idx = out_idx_zw+(y_idx+i)*array_xx_x + x_idx;
                array_zz_idx = array_zz_zw + (y_idx+i);
                array_xx_idx = ((y_idx+i) * array_xx_x) + x_idx;
                //actual final output. 
                out_array[out_array_idx] = array_zz[array_zz_idx] * array_xx[array_xx_idx];

您必须将启动尺寸设置为：

thread_dim = (work_tile_dim, work_tile_dim/ilp_factor) # (32,8)
y_dim = xx0.shape[0]
x_dim = xx0.shape[1]
wz_dim = zz0.shape[0] * zz0.shape[1]
block_dim = (x_dim/work_tile_dim, y_dim/work_tile_dim, wz_dim)

还有一些您可以利用的进一步优化：

将全局内存访问存储在内核内部共享内存中的工作 tile 中，这样可以确保访问 zz0s "y"，但实际上 x 维度在放入共享内存时会合并，提高性能，然后从共享内存中访问（合并无关紧要，但银行冲突确实如此）。请参阅here，了解如何处理这种银行冲突。

不是计算欧拉公式并将双精度数扩展为复数双精度数，而是在内核本身内部扩展它，使用sincos(-x, &out_sin, &out_cos) 来获得相同的结果，但使用的内存带宽更少（请参阅here） .

但请注意，即使这样做也可能不会为您提供您想要的性能（尽管仍然可能会更快），除非您使用的是具有完整双精度单元的高端 GPU，而大多数 GPU（大多数它被模拟的时间）。双精度浮点单元占用大量空间，而且由于 gpus 用于图形，因此双精度并没有太多用处。如果您想要比浮点更高的精度，但又想利用浮点硬件而不是双倍吞吐量的 1/8 到 1/32，您可以使用this answer 中使用的技术在 gpu 上实现这一点，让您接近 1/2 到 1/3 的吞吐量。

【讨论】：

以上是关于在 Numpy 数组上使用 Pycuda 的 GPU 数组乘法的主要内容，如果未能解决你的问题，请参考以下文章