即使对于巨型矩阵，NUMBA CUDA 也比并行 CPU 慢

Posted 2023-03-23

技术标签:

【中文标题】即使对于巨型矩阵，NUMBA CUDA 也比并行 CPU 慢【英文标题】：NUMBA CUDA slower than parallel CPU even for giant matrices 【发布时间】：2020-11-13 19:01:02 【问题描述】：

网上只有少数几个使用 cuda 进行 numba 的示例，我发现它们都比并行 CPU 方法慢。带有 CUDA 目标和模板的矢量化甚至更糟，所以我尝试创建一个自定义内核。您随处可见的一篇博文是https://gist.github.com/mrocklin/9272bf84a8faffdbbe2cd44b4bc4ce3c。这个例子是一个简单的模糊滤镜：

import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda


@numba.cuda.jit
def smooth_gpu(x, out):
    i, j = cuda.grid(2)
    n, m = x.shape

    if 1 <= i < n - 1 and 1 <= j < m - 1:
        out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                    x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                    x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

x_gpu = np.ones((10000, 10000), dtype='float32')
out_gpu = np.zeros((10000, 10000), dtype='float32')

threadsperblock = (16, 16)
blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

# run on gpu
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu) # compile before measuring time
start_time = time.time()
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)
print("GPU Time: 0:1.6fs ".format(time.time() - start_time))

和CPU版本：

x_cpu = np.ones((10000, 10000), dtype='float32')
out_cpu = np.zeros((10000, 10000), dtype='float32')


@njit(nopython=True,parallel=True)
def smooth_cpu(x, out_cpu):

    for i in prange(1,np.shape(x)[0]-1):
        for j in range(1,np.shape(x)[1]-1):
            out_cpu[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

# run on cpu
smooth_cpu(x_cpu, out_cpu) # compile before measuring time
start_time = time.time()
smooth_cpu(x_cpu, out_cpu)
print("CPU Time: 0:1.6fs ".format(time.time() - start_time))

GPU 版本约 500 毫秒，CPU 版本约 50 毫秒。怎么回事？

【问题讨论】：

我想如果你写了this comment，那么你可能会对cupy和numpy数组之间的区别感到困惑。一个cupy数组存在于GPU上。因此，当它在内核（cupy 或 numba）中使用时，不需要或暗示复制。但是由于您将该示例转换为 numpy 数组，因此您进入了复制场景。因此，在性能方面，这两种实现并不等同。下面的答案显示了从 numpy 数组开始时如何使它们大致“等效”（cupy vs. numpy）。你是对的。我认为他仍然想念一些“cuda.synchronize()”，不是吗？现在直接比较，下面的答案实际上更快。我现在会留在 numpy + numba :) 是的，我怀疑所写的示例只是测量 GPU 内核启动开销，但我没有仔细研究过。我也不知道那里正在使用什么 GPU，似乎没有提到它，但也许它在某个地方之前的博客文章中。反正我没有仔细研究过。 timeit 实际上可能将一些不同的行为涂抹在一起。为了在 cupy 或 numba 中仔细计时仅内核执行，我建议我在下面指出的方法：使用设备驻留数组，并小心使用cuda.synchronize() 来保护计时区域。 【参考方案1】：

我要指出两点：

您在 GPU 版本的时序中包括将输入数组从主机传输到设备以及结果从设备传输到主机所需的时间。如果这是您比较的目的，那就这样吧；结论是 GPU 不适合这项任务（以一种有趣的方式）。

GPU 代码在提供正确结果时并没有为获得良好性能而组织。问题出在这里：

i, j = cuda.grid(2)

再加上这些索引用于访问数据的顺序：

out[i, j] = (x[i - 1, j - 1] ...

这会导致 GPU 访问效率低下。我们可以通过颠倒上面描述的两个顺序之一来解决这个问题。

考虑到上述两个问题，您的代码稍作调整：

$ cat t29a.py
import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda


x_cpu = np.ones((10000, 10000), dtype='float32')
out_cpu = np.zeros((10000, 10000), dtype='float32')


@njit(parallel=True)
def smooth_cpu(x, out_cpu):

    for i in prange(1,x.shape[0]-1):
        for j in range(1,x.shape[1]-1):
            out_cpu[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] + x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

# run on cpu
smooth_cpu(x_cpu, out_cpu) # compile before measuring time
start_time = time.time()
smooth_cpu(x_cpu, out_cpu)
print("CPU Time: 0:1.6fs ".format(time.time() - start_time))
$ python t29a.py
CPU Time: 0.161944s

$ cat t29.py
import numpy as np
import time
from numba import njit, prange,cuda
import timeit
import numba.cuda
import math

@numba.cuda.jit
def smooth_gpu(x, out):
    j, i = cuda.grid(2)
    m, n = x.shape

    if 1 <= i < n - 1 and 1 <= j < m - 1:
        out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                    x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                    x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

x = np.ones((10000, 10000), dtype='float32')
out = np.zeros((10000, 10000), dtype='float32')
x_gpu = cuda.to_device(x)
out_gpu = cuda.device_array_like(out)
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

# run on gpu
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu) # compile before measuring time
cuda.synchronize()
start_time = time.time()
smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)
cuda.synchronize()
print("GPU Time: 0:1.6fs ".format(time.time() - start_time))
$ python t29.py
GPU Time: 0.021776s
$

所以我们看看如果我们针对所指出的两个问题进行调整，GPU（在我的例子中是 GTX 960）比 CPU 快大约 8 倍。这样的测量在某种程度上取决于用于比较的 CPU 和 GPU——你不应该假设我的测量与你的测量相当——你最好运行这些修改后的代码进行比较。但是，数据传输时间肯定比 GPU 计算时间要大很多，在我的情况下也超过 CPU 计算时间。这意味着（至少在我的情况下，在任何方面都不是一个特别快的系统）即使我们将 GPU 计算时间减少到零，传输数据的成本仍然会超过 CPU 计算时间成本。

因此，当您遇到这种情况时，是不可能获胜的。那时可以给出的唯一建议是“不要那样做”，即找到一个更有趣和更复杂的问题供 GPU 解决。如果我们在计算上让问题变得非常简单，比如这个，或者向量加法，这是你唯一想在 GPU 上做的事情，与在 CPU 上做的比较几乎从来都不是一个有趣的比较。希望您能看到，在这里增大矩阵并没有多大帮助，因为它还会影响数据传输时间/成本。

如果我们考虑到数据传输成本（并且不要在我们的 GPU 代码中犯下严重的性能错误），根据我的测试，GPU 比 CPU 快。如果我们包括数据传输成本，对于这个非常简单的问题，GPU 很可能无法比 CPU 更快（即使 GPU 计算时间减少到零）。

毫无疑问，可以做更多的事情来稍微改进 GPU 的情况（例如改变块形状、使用共享内存等），但我个人不希望把时间花在打磨无趣的东西上。

您可以获取有关 Numba GPU 内存管理here 的更多说明。

与索引排序相关的内存效率问题的一般描述是here

【讨论】：

阅读内联 cmets，我有点困惑为什么“在测量时间之前编译”是有意义的，以及为什么这实际上与调用函数不同。编译函数的成本是一次性的吗，期望函数会被调用多次？ @MackieMesser 编译是一种操作，第一次调用 JIT 修饰函数（cpu 或 gpu）时，它会被编译，编译后的代码会在程序的生命周期内缓存在内存中。我不确定这个缓存编译是否特定于编译时函数签名中使用的数据类型。这可能意味着，如果您在 int8 上运行函数一次，然后在 int16 上运行函数，则可能必须针对每种签名风格进行编译（我假设仍然缓存两者）。 @RobertCrovella 关于您在哪里调用smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu) 以在计时执行之前预编译内核的快速问题。您是否可以等效地调用 compiled = smooth_gpu[blockspergrid, threadsperblock] 之类的东西来处理编译而不必等待完整的函数执行？那么对于计时，您应该能够简单地调用：compiled(x_gpu, out_gpu) 并且仍然使用相同的网格和块尺寸。名义上似乎可以工作，但我怀疑它实际上并没有提前完成编译。您的构造表明在“编译时”无法推断出参数类型。所以我怀疑它正在做你想做的事。我不想在这里在 cmets 中进一步讨论它。如果您有新问题，请提出新问题。【参考方案2】：

我发现这个比较很有趣，并想研究重用编译内核、cuda 流和随机数据的影响，以确保没有花哨的编译器优化扭曲我们所看到的。

我修改了 Robert Crovella 发布的代码示例，并在学校的一个普通 ML 设备上运行了该脚本：

代码

import numpy as np
from time import perf_counter
from numba import njit, prange,cuda

# cpuinfo is a third party package from here:
#   https://github.com/workhorsy/py-cpuinfo
# or you can just install it using pip with:
#   python -m pip install -U py-cpuinfo
from cpuinfo import get_cpu_info

print("Some diagnostic info for the system running this script:")
# prints information about the cuda GPU
cuda.detect()
print()
# Prints a json string describing the cpu
s = get_cpu_info()
print("Cpu info")
for k,v in s.items():
    print(f"\tk: v")
print()

cpu_s1 = "CPU execution time:"
cpu_s2 = "CPU full setup/execution time:"
gpu_s1 = "GPU kernel execution time:"
gpu_s2 = "GPU full kernel setup/execution time:"
l = len(gpu_s2) + 1
# using randomized floats to ensure there isn't some compiler optimization that
# recognizes that all values of the x array are constant 1's and does something
# goofy under the hood. Each timing scenario will then use a copy of this array.
common_x = np.random.random((10000, 10000)).astype(np.float32)

def time_njit(n_loops=2):
    start_time_full_function = perf_counter()

    @njit(parallel=True,nogil=True)
    def smooth_cpu(x, out):
        h,w = x.shape
        for i in prange(1,h-1):
            for j in range(1,w-1):
                out[i, j] =  (x[i - 1, j - 1] + x[i - 1, j] +
                                  x[i - 1, j + 1] + x[i    , j - 1] +
                                  x[i    , j]     + x[i    , j + 1] +
                                  x[i + 1, j - 1] + x[i + 1, j] +
                                  x[i + 1, j + 1]) / 9


    pre_x = np.ones((10,10),dtype=common_x.dtype)
    pre_out = np.ones((10,10),dtype=common_x.dtype)
    _x = common_x.copy()
    _out = np.zeros_like(_x)
    # run on cpu
    smooth_cpu(pre_x, pre_out) # compile before measuring time
    start_time = perf_counter()
    for _ in range(n_loops):
        # realistically, we wouldn't typically run just a single blurring pass
        smooth_cpu(_x, _out)
        smooth_cpu(_out,_x)
    end_time = perf_counter()
    end_time_full_function = perf_counter()
    print(f"cpu_s1:<l end_time - start_time:1.6fs running n_loops loops"
          f"\ncpu_s2:<l end_time_full_function - start_time_full_function:1.6fs")
    return _x



def time_cuda(n_loops=2):
    """There is room for optimization in how we use cuda.shared.array memory on the GPU
    -- where I'm not aware of any analogues tricks for the cpu function -- that would
    allow us to minimize the number of times each thread-block needs to access data in
    the GPU's global memory. But such an implementation would take us deeper into the
    weeds than this toy problem calls for.

    Maybe if I need to take a break from my other work later I'll come back to this
    and flesh out an example of what I mean.
    """
    start_time_full_function = perf_counter()
    @cuda.jit
    def smooth_gpu(x, out):
        """slight change to the cuda kernel. This version uses **striding** to reduce
        processor overhead spent allocating and deallocating a lot of thread blocks
        that ultimately have each thread compute a single calculation before being
        disposed of.

        This way we offset some of the overhead cost spent on block allocation by
         making each block do a bit more work.

        Note: For this to work right, we have to allocate fewer blocks with
              our `blockspergrid_j` and `blockspergrid_i` variables.
        """
        jstart, istart = cuda.grid(2)
        jstep, istep = cuda.gridsize(2)
        rows,cols = x.shape
        # note that for strided kernels, thread indices
        # are completely independent of the data size/shape
        for i in range(istart+1,rows-1,istep):
            for j in range(jstart+1,cols-1,jstep):
                # Because we created x and out using column-major memory ordering,
                # we want to make sure the most frequently changing index (j)
                # is iterating through the last dimension of the array.
                out[i,j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                            x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                            x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) / 9

    _x = common_x.copy()
    _out = np.zeros_like(_x)
    stream = cuda.stream()
    x_gpu = cuda.to_device(_x,stream)
    out_gpu = cuda.to_device(_out,stream)
    tpbj = 16
    tpbi = 16
    threadsperblock = tpbj,tpbi
    blockspergrid_j = (_x.shape[0]+tpbj-1) // tpbj
    blockspergrid_i = (_x.shape[1]+tpbi-1) // tpbi
    # reduce the number of blocks in each axis
    # by a quarter to give room for striding
    blockspergrid = (blockspergrid_j//4, blockspergrid_i//4)
    # run on gpu
    compiled = smooth_gpu[blockspergrid, threadsperblock, stream] # compile before measuring time

    start_time = perf_counter()
    for _ in range(n_loops):
        # realistically, we wouldn't typically run just a single blurring pass
        compiled(x_gpu, out_gpu)
        compiled(out_gpu,x_gpu)
    x_gpu.copy_to_host(_out,stream)
    stream.synchronize()
    end_time = perf_counter()
    end_time_full_function = perf_counter()
    print(f"gpu_s1:<l end_time-start_time:1.6fs running n_loops loops"
          f"\ngpu_s2:<l end_time_full_function-start_time_full_function:1.6fs")
    return _out

if __name__ == '__main__':
    a = time_njit(1)
    b = time_cuda(1)
    assert np.allclose(a,b),"The two functions didn't actually compute the same results"
    print(f"'    '*4Outputs are equivalent")
    a = time_njit(5)
    b = time_cuda(5)
    assert np.allclose(a,b),"The two functions didn't actually compute the same results"
    print(f"'    '*4Results are equivalent")
    a = time_njit(10)
    b = time_cuda(10)
    assert np.allclose(a,b),"The two functions didn't actually compute the same results"
    print(f"'    '*4Results are equivalent")
    a = time_njit(20)
    b = time_cuda(20)
    assert np.allclose(a,b),"The two functions didn't actually compute the same results"
    print(f"'    '*4Results are equivalent")

输出：

Some diagnostic info for the system running this script:

Found 1 CUDA devices
id 0    b'GeForce RTX 2080 Ti'                              [SUPPORTED]
                      compute capability: 7.5
                           pci device id: 0
                              pci bus id: 1
Summary:
    1/1 devices are supported

Cpu info:
    python_version: 3.8.8.final.0 (64 bit)
    cpuinfo_version: [7, 0, 0]
    cpuinfo_version_string: 7.0.0
    arch: X86_64
    bits: 64
    count: 8
    arch_string_raw: AMD64
    vendor_id_raw: GenuineIntel
    brand_raw: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
    hz_advertised_friendly: 4.0000 GHz
    hz_actual_friendly: 4.0010 GHz
    hz_advertised: [4000000000, 0]
    hz_actual: [4001000000, 0]
    l2_cache_size: 1048576
    stepping: 3
    model: 60
    family: 6
    l3_cache_size: 8388608
    flags: ['3dnow', 'abm', 'acpi', 'aes', 'apic', 'avx', 'avx2', 'bmi1', 'bmi2', 'clflush', 'cmov', 'cx16', 'cx8', 'de', 'dts', 'erms', 'est', 'f16c', 'fma', 'fpu', 'fxsr', 'ht', 'hypervisor', 'ia64', 'invpcid', 'lahf_lm', 'mca', 'mce', 'mmx', 'movbe', 'msr', 'mtrr', 'osxsave', 'pae', 'pat', 'pbe', 'pcid', 'pclmulqdq', 'pdcm', 'pge', 'pni', 'popcnt', 'pse', 'pse36', 'rdrnd', 'sep', 'serial', 'smep', 'ss', 'sse', 'sse2', 'sse4_1', 'sse4_2', 'ssse3', 'tm', 'tm2', 'tsc', 'vme', 'xsave', 'xtpr']
    l2_cache_line_size: 256
    l2_cache_associativity: 6

                Time comparisons for CPU vs GPU implementations:

CPU execution time:                    0.327143s running 1 loops
CPU full setup/execution time:         0.980959s
GPU kernel execution time:             0.088015s running 1 loops
GPU full kernel setup/execution time:  0.868085s
                Outputs are equivalent
CPU execution time:                    1.539007s running 5 loops
CPU full setup/execution time:         2.134781s
GPU kernel execution time:             0.097627s running 5 loops
GPU full kernel setup/execution time:  0.695104s
                Outputs are equivalent
CPU execution time:                    3.463488s running 10 loops
CPU full setup/execution time:         4.310506s
GPU kernel execution time:             0.122363s running 10 loops
GPU full kernel setup/execution time:  0.655500s
                Outputs are equivalent
CPU execution time:                    6.416840s running 20 loops
CPU full setup/execution time:         7.011254s
GPU kernel execution time:             0.158903s running 20 loops
GPU full kernel setup/execution time:  0.723226s
                Outputs are equivalent
CPU execution time:                    9.285086s running 30 loops
CPU full setup/execution time:         9.890282s
GPU kernel execution time:             0.209807s running 30 loops
GPU full kernel setup/execution time:  0.728618s
                Outputs are equivalent
CPU execution time:                    12.610949s running 40 loops
CPU full setup/execution time:         13.177427s
GPU kernel execution time:             0.253696s running 40 loops
GPU full kernel setup/execution time:  0.836536s
                Outputs are equivalent
CPU execution time:                    15.376767s running 50 loops
CPU full setup/execution time:         15.976361s
GPU kernel execution time:             0.289626s running 50 loops
GPU full kernel setup/execution time:  0.841918s
                Outputs are equivalent

Process finished with exit code 0

老实说，这些结果既符合我的期望，也符合我的期望。我曾预计至少单循环函数调用会看到 CPU 实现优于 GPU，但事实并非如此。 :v/ 不过，随着循环次数的增加，CPU 的时间成本似乎呈线性增加是意料之中的。

至于 GPU 性能，我真的不知道为什么增加循环次数的时间成本似乎是对数增长（我必须绘制数据点才能更清楚地看到它）。

无论如何，您看到的结果会因您的机器而异，但我很好奇 GPU 结果下降到什么 cuda 计算级别才能与 CPU 相匹配。

【讨论】：

以上是关于即使对于巨型矩阵，NUMBA CUDA 也比并行 CPU 慢的主要内容，如果未能解决你的问题，请参考以下文章