Numba 无法使用完整的 GPU

Posted 2023-04-15

技术标签:

【中文标题】Numba 无法使用完整的 GPU【英文标题】：Numba failing to use the full GPU 【发布时间】：2021-12-19 22:23:06 【问题描述】：

我最近开始使用 Numba 作为大学作业的一部分，目的是比较 Numba 和 Cuda-C 之间特定的 GPU 并行代码的性能。我已经用 Cuda-C 编写了代码，它运行得很好，我什至用 Nsight 检查了 GPU 占用率。但是当我将它移到 Numba 并进行适当的调整时，我的任务管理器显示代码只使用了可用 GPU 的一小部分 (20~30%)。

我开始测试一些标准代码，例如矩阵乘法，它在 Numba 上运行良好。更有趣的是，将我的代码中的数据大小加倍不会使 GPU 占用率加倍。

有人知道如何解决这个问题吗？

附加信息：

我在 Lenovo Ideapad GAMING 3i 上使用 GTX 1650 (4GB)；通过 Anaconda 在 Spyder 5.1.5 上运行代码； Python 3.8 版； Numba 版本 0.5.4.1；我已经重新安装了 anaconda,python,numba 但结果还是一样；

下面我介绍了不起作用的内核，这是一个简单的 Reduce 算法，带有一个额外的步骤，在求和之前对所有项进行平方：

from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np
from numpy import float32
import random as rnd
import sys
import time

N = 1024;

n = 32;

stride = [];

for i in range(5):
    a = n//2**(i+1);   
    stride.append(a);

stride = np.array(stride,dtype = np.int32);    
#stride = np.array(stride,dtype=np.int32);


n_particles = 8*1024;
n = 32;

@cuda.jit
def sphere(d_pos,cost,n,stride):
    
    index = cuda.threadIdx.y;

    i = cuda.threadIdx.x + cuda.blockDim.x * cuda.blockIdx.x;
    
    p = cuda.blockDim.y * cuda.threadIdx.x;

    #definindo o vetor de memoria compartilhada
    #memoria máxima disponível: 48 kBytes/SM
    sharray = cuda.shared.array(N,float32);

    #por segurança, garantir que o indice está dentro dos valores permitidos
    if (index < n): 

        #movendo os dados da memoria global para a memoria compartilhada
        sharray[index + p] = d_pos[index + p + N* cuda.blockIdx.x];
        sharray[index + p] *= sharray[index + p];
        
        cuda.syncthreads();

        #algoritimo de REDUCE para calcular "cost"!
        for std in range(len(stride)): 
            
                if (index < stride[std]): 
                        sharray[index + p] += sharray[index + p + stride[std]];
            
        
        
        cuda.syncthreads();

        #retornando o valor de "cost"
        if (index % n == 0):
            cost[i] = sharray[cuda.threadIdx.x* n];
            

d_pos = cuda.to_device(np.ones((n_particles*n),dtype = np.float32));
d_vel = cuda.to_device(np.ones((n_particles*n),dtype = np.float32));

cost = cuda.to_device(np.zeros(n_particles,dtype = np.float32));

B =  int(((n*n_particles-1)/1024 +1));

t0=time.time();

for i in range(2000):     
    
    sphere[(B,1),(32,32)](d_pos,cost,n,stride);
    cuda.synchronize();
    
print(cost.copy_to_host());
print(time.time()-t0);

【问题讨论】：

因为 Numba JIT 是动态类型的，所以函数调用的参数决定了发出和运行的代码。在任何人回答您的问题之前，您至少需要提供一个完整的内核调用示例感谢@talonmies 的提示，现在可以编译了我很困惑。每个块有 8192 个粒子和 1024 个线程。那只是8个街区。少量的工作不可能充分利用您正在使用的 GPU。对于可能有 16 个 SM 并且每个 SM 能够运行 2 个块的 GPU，20-30% 听起来完全合理。那将是 8/32 = 25% 的利用率。对我来说听起来很合理任务管理器不是占用率的衡量标准。你在比较苹果和橘子。在您的 numba 案例上运行 nsys 分析器并检查占用率，就像您在 CUDA C++ 案例中所做的那样。 @talonmies 实际上块的数量是 B = 256，是的，有 8192 个粒子，但每个粒子都有 32 个分量，总共有 262,144 个 float32 元素的数据大小 【参考方案1】：

首先，让我们整理一些术语。

occupancy（在 CUDA 中）指的是内核使用 GPU 计算资源的程度。它没有“时间”的概念，可以静态检查（使用the CUDA occupancy calculator）。它不用于指代“GPU 利用率”。

utilization（在 CUDA 中）是指在某个采样间隔内，CUDA 内核在 GPU 上运行的时间百分比。它不会告诉您正在使用哪些资源

（内存利用率当然是指一个CUDA内核在特定时间使用了多少内存）

分析器可以报告占用情况。据我所知，various gpu measurements in the windows task manager 主要基于利用率。

在我看来，将来自分析器的占用率报告与利用率测量值进行比较几乎没有意义。希望现在很清楚。

现在您已经澄清，您用 CUDA C++ 编写的代码（您尚未显示）在 Windows 任务管理器显示中实现了很高的数字（我们真的不知道您指的是哪个度量，但让我们离开除此之外），而您的“等效” numba 代码没有，我们可以问为什么？

要准备测试用例，您的代码缺少导入语句：

from numba import cuda

当我在 GTX 960 GPU 上按原样分析您的代码时，我们会看到以下内容：

nvprof --print-gpu-trace python t78.py
... (some items clipped out)
5.74555s  155.52us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [89992]
5.74571s  1.1840us                    -               -         -         -         -       20B  16.109MB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]
5.74647s  1.0240us                    -               -         -         -         -       20B  18.626MB/s    Pageable      Device  NVIDIA GeForce          1         7  [CUDA memcpy HtoD]
5.74655s  157.28us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [90009]
5.74672s  1.1840us                    -               -         -         -         -       20B  16.109MB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]
5.74748s  1.0240us                    -               -         -         -         -       20B  18.626MB/s    Pageable      Device  NVIDIA GeForce          1         7  [CUDA memcpy HtoD]
5.74756s  155.01us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [90026]
5.74772s  1.1840us                    -               -         -         -         -       20B  16.109MB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]
5.74848s  1.0240us                    -               -         -         -         -       20B  18.626MB/s    Pageable      Device  NVIDIA GeForce          1         7  [CUDA memcpy HtoD]
5.74856s  156.35us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [90043]
5.74872s  1.1830us                    -               -         -         -         -       20B  16.123MB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]
5.74890s  5.6640us                    -               -         -         -         -  32.000KB  5.3880GB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

以上是trace输出的尾部，我们可以做一些观察。

nvidia-smi

利用率计算为内核实际运行的时间百分比。如果内核相隔 1000us 启动，每个内核需要 150us 运行，那么利用率应该是 150/1000 = 15%，这与 nvidia-smi 报告接近。

如果我想提高利用率，我可能会做 3 件事：

stride

cuda.synchronize()

我们可以通过声明一个设备数组来简单地实现上面的步骤 1：

d_stride = cuda.to_device(stride)

在内核启动之前，修改内核启动以使用d_stride 代替stride。

如果我只执行上面的第 2 步，我看到 nvidia-smi 报告的利用率增加了大约 1%。如果我执行上述步骤 1 和 2，我会看到这个新的分析器输出（尾端）：

nvprof --print-gpu-trace python t78.py
... (some items clipped out)
2.50316s  140.93us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25025]
2.50352s  141.19us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25030]
2.50388s  141.95us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25035]
2.50428s  142.21us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25040]
2.50464s  141.28us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25045]
2.50504s  142.82us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25050]
2.50540s  139.78us            (256 1 1)       (32 32 1)        27  4.0000KB        0B         -           -           -           -  NVIDIA GeForce          1         7  cudapy::__main__::sphere$241(Array<float, int=1, C, mutable, aligned>, Array<float, int=1, C, mutable, aligned>, __int64, Array<int, int=1, C, mutable, aligned>) [25055]
2.50562s  6.0480us                    -               -         -         -         -  32.000KB  5.0459GB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy

内核执行持续时间没有太大变化（我没想到会这样），仍然是 140us 左右。然而，中间的 HtoD 和 DtoH 副本消失了，现在从一个内核启动到下一个内核启动的时间约为 360us。以这种方式计算的利用率是 140/360 = 38.9%，实际上nvidia-smi 报告此代码运行时的利用率为 39%。

对于应该背靠背的内核启动，这仍然不是 100% 的利用率。我尚未创建此代码的 C++ 版本，但根据我的经验，我相当有信心实现 90% 以上的利用率。

我们现在剩下的问题是，根据这个测试案例，我们可以在 numba 中启动两个内核的最接近的时间似乎是大约 360us，而我相信 C++ 中这个数字可能是 40us 或更少。在任何小于~140us 内核持续时间的启动开销下，内核执行可能会变成“背靠背”，转化为大约 100% 的利用率。您未显示的代码数据似乎证实了这一点。

怎么办？

即使在 CUDA C++ 中，如果内核持续时间短于启动开销，那么背靠背启动的非常短的内核仍可能无法达到 100% 的利用率。解决方案？设计足以使 GPU 饱和的内核（这会恢复占用）并且有足够的工作要做，因此内核持续时间明显长于启动开销。

还不满意？确保您使用的是最新版本的 numba 和 CUDA，如果启动开销仍然存在问题，请提交numba issue，但它们当然不能使启动开销完全消失。

（我没有在这里展示它，但是如果我们将分析开关从--print-gpu-trace 更改为--print-api-trace，我们可以收集一些相当有说服力的证据，证明 numba CUDA 正在使用的底层 CUDA API 不应归咎于这个 360us 发射开销的大部分。）

【讨论】：

非常感谢你，你真的帮助了我！这个内核是粒子群优化的一部分，正如我试图比较的那样。当我在 C 上进行实现时，我在小内核中进行了所有操作，以使其更有条理，但它似乎运行良好，所以我认为只需将其重写为 Numba 就足够了，从没想过会有更多的开销。我试图将循环放入内核中，利用率达到 100%，这证明了你的观点。我会将整个代码编写为单个内核，以最大程度地减少开销。问候亚瑟 @Robert 我不确定这里是否没有提到，或者我错过了！如何使用 nvidia-smi 估算利用率？建议：在本页顶部的搜索框中，输入文字nvidia-smi utilization，然后回车。我想你会找到有用的信息。此外，nvidia-smi 实用程序提供命令行帮助。 nvidia-smi --help

以上是关于Numba 无法使用完整的 GPU的主要内容，如果未能解决你的问题，请参考以下文章

为啥同时使用 numba.cuda 和 CuPy 从 GPU 传输数据这么慢？

如何使用 Python 和 Numba 获取 GPU 中的 CUDA 内核数量？

如何使用 python 和 numba 在 RTX GPU 中对 NVIDIA 的张量核心进行编程？

使用 numba 无法获得与 numpy 元素矩阵乘法相同的值

Python numpy：无法将 datetime64[ns] 转换为 datetime64[D]（与 Numba 一起使用）

为啥启动 Numba cuda 内核最多可使用 640 个线程，但在有大量可用 GPU 内存时却因 641 而失败？