为啥这个 numba 代码比 numpy 代码慢 6 倍？

Posted 2023-02-16

技术标签:

【中文标题】为啥这个 numba 代码比 numpy 代码慢 6 倍？【英文标题】：Why this numba code is 6x slower than numpy code?为什么这个 numba 代码比 numpy 代码慢 6 倍？ 【发布时间】：2018-11-12 12:11:56 【问题描述】：

下面的代码运行2s有什么原因吗，

def euclidean_distance_square(x1, x2):
    return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1)

而以下 numba 代码在 12 秒内运行？

@jit(nopython=True)
def euclidean_distance_square(x1, x2):
   return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1)

我的 x1 是一个维度为 (1, 512) 的矩阵，x2 是一个维度为 (3000000, 512) 的矩阵。 numba 可以慢得多，这很奇怪。是不是我用错了？

我真的需要加快这个速度，因为我需要运行这个函数 300 万次，而 2s 仍然太慢了。

我需要在 CPU 上运行它，因为您可以看到 x2 的尺寸非常大，它无法加载到 GPU（或至少我的 GPU）上，内存不足。

【问题讨论】：

这可能是系统配置的问题（例如，您的 numpy 通过使用 OpenCL 来利用您的 GPGPU） @BasileStarynkevitch 由于内存问题，无法在 GPU 上运行。 numba 不应该在 CPU 上加速吗？ Numba 文档指出它是纯 python 而 numpy 使用大量 C，我猜这是最大的效率差异 @OferSadan 所以 Numba 只加速非 numpy 代码？文档似乎建议它也应该加速 numpy 代码。您对我如何加快速度有什么建议吗？ @MSeifert 好的。我在这里转贴：***.com/questions/50675705/…. 【参考方案1】：

这是对@MSeifert 答案的评论。还有更多的东西可以获得性能。与每个数字代码一样，建议考虑哪种数据类型对您的问题足够精确。通常 float32 也足够了，有时甚至 float64 也不够。

我还想在这里提一下 fastmath 关键字，它可以在这里再提高 1.7 倍的速度。

[编辑]

对于一个简单的求和，我查看了 LLVM 代码，发现求和在矢量化上被拆分为部分求和。（使用 AVX2 的双精度数为 4 个部分总和，浮点数为 8 个）。这需要进一步调查。

代码

import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')

@nb.njit
def euclidean_distance_square_numba_v3(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

@nb.njit(fastmath=True)
def euclidean_distance_square_numba_v4(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

@nb.njit(fastmath=True,parallel=True)
def euclidean_distance_square_numba_v5(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in nb.prange(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

时间

float64
x1 = np.random.random((1, 512))
x2 = np.random.random((1000000, 512))

0.42 v3 @MSeifert
0.25 v4
0.18 v5 parallel-version
0.48 distance.cdist

float32
x1 = np.random.random((1, 512)).astype(np.float32)
x2 = np.random.random((1000000, 512)).astype(np.float32)

0.09 v5

如何显式声明类型

一般来说，我不会推荐这个。您的输入数组可以是 C 连续的（作为测试数据）Fortran 连续的或跨步的。如果您知道您的数据始终是 C-contiguos，您可以编写

@nb.njit('double[:](double[:, ::1],double[:, ::1])',fastmath=True)
def euclidean_distance_square_numba_v6(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

这提供了与 v4 版本相同的性能，但如果输入数组不是 C-contigous 或不是 dtype=np.float64，则会失败。

你也可以使用

@nb.njit('double[:](double[:, :],double[:, :])',fastmath=True)
def euclidean_distance_square_numba_v7(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

这也适用于跨步数组，但会比上面的 C 连续数组版本慢得多。（0.66s 与 0.25s）。另请注意，您的问题受到内存带宽的限制。 CPU 密集型计算的差异可能更大。

如果您让 Numba 为您完成这项工作，它将自动检测数组是否是连续的（在第一次尝试时提供连续的输入数据，而不是非连续的数据，将导致重新编译）

【讨论】：

你的回答有错字吗？你的 float32 时间比 float64 慢？ Numpy 默认为 float64。所以当你不给它一个dtype时，它是float64而不是32 对不起，我复制代码时出错了……float32版本比float64版本快两倍。关于fastmath 的好点 - 但是我不愿意说它提高了精度。这在很大程度上取决于特定的操作，并且通常会降低精度（至少与 IEEE 754 数学相比）。我还测试了并行，它实际上有点慢（因为它的内存带宽有限）所以我发现它在你的测试中更快真的很有趣。可能是因为 fastmath 或者缓存速度不同？出于好奇：您是如何进行基准测试的？还有%timeit? @max9111 我更新了帖子。我稍微修改了代码，以便它可以处理 (m, n) 维 x1。不确定我是否做得正确。能帮忙验证一下吗？还是有点慢。【参考方案2】：

numba 可以这么慢是很奇怪的。

这并不奇怪。当您在 numba 函数中调用 NumPy 函数时，您将调用这些函数的 numba 版本。这些可以更快、更慢或与 NumPy 版本一样快。您可能很幸运，也可能很不幸（您很不幸！）。但即使在 numba 函数中，您仍然会创建许多临时数组，因为您使用 NumPy 函数（一个临时数组用于点结果，一个用于每个平方和总和，一个用于点加第一个总和），因此您不会利用numba 的可能性。

我是不是用错了？

基本上：是的。

我真的需要加快速度

好的，我试试看。

让我们从沿轴 1 调用展开平方和开始：

import numba as nb

@nb.njit
def sum_squares_2d_array_along_axis1(arr):
    res = np.empty(arr.shape[0], dtype=arr.dtype)
    for o_idx in range(arr.shape[0]):
        sum_ = 0
        for i_idx in range(arr.shape[1]):
            sum_ += arr[o_idx, i_idx] * arr[o_idx, i_idx]
        res[o_idx] = sum_
    return res


@nb.njit
def euclidean_distance_square_numba_v1(x1, x2):
    return -2 * np.dot(x1, x2.T) + np.expand_dims(sum_squares_2d_array_along_axis1(x1), axis=1) + sum_squares_2d_array_along_axis1(x2)

在我的计算机上，它已经比 NumPy 代码快 2 倍，比原始 Numba 代码快近 10 倍。

从经验上来说，让它比 NumPy 快 2 倍通常是极限（至少在 NumPy 版本不是不必要的复杂或低效的情况下），但是您可以通过展开所有内容来挤出更多内容：

import numba as nb

@nb.njit
def euclidean_distance_square_numba_v2(x1, x2):
    f1 = 0.
    for i_idx in range(x1.shape[1]):
        f1 += x1[0, i_idx] * x1[0, i_idx]

    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            val_from_x2 = x2[o_idx, i_idx]
            val += (-2) * x1[0, i_idx] * val_from_x2 + val_from_x2 * val_from_x2
        val += f1
        res[o_idx] = val
    return res

但这仅比最新方法提高了约 10-20%。

此时您可能会意识到您可以简化代码（即使它可能不会加快速度）：

import numba as nb

@nb.njit
def euclidean_distance_square_numba_v3(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

是的，这看起来很简单，而且速度并不慢。

然而，在所有的兴奋中，我忘了提及明显解决方案：scipy.spatial.distance.cdist，它有一个 sqeuclidean（平方欧几里得距离）选项：

from scipy.spatial import distance
distance.cdist(x1, x2, metric='sqeuclidean')

它并不比 numba 快，但无需编写自己的函数即可使用...

测试

测试正确性并进行热身：

x1 = np.array([[1.,2,3]])
x2 = np.array([[1.,2,3], [2,3,4], [3,4,5], [4,5,6], [5,6,7]])

res1 = euclidean_distance_square(x1, x2)
res2 = euclidean_distance_square_numba_original(x1, x2)
res3 = euclidean_distance_square_numba_v1(x1, x2)
res4 = euclidean_distance_square_numba_v2(x1, x2)
res5 = euclidean_distance_square_numba_v3(x1, x2)
np.testing.assert_array_equal(res1, res2)
np.testing.assert_array_equal(res1, res3)
np.testing.assert_array_equal(res1[0], res4)
np.testing.assert_array_equal(res1[0], res5)
np.testing.assert_almost_equal(res1, distance.cdist(x1, x2, metric='sqeuclidean'))

时间安排：

x1 = np.random.random((1, 512))
x2 = np.random.random((1000000, 512))

%timeit euclidean_distance_square(x1, x2)
# 2.09 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_original(x1, x2)
# 10.9 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v1(x1, x2)
# 907 ms ± 7.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v2(x1, x2)
# 715 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v3(x1, x2)
# 731 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit distance.cdist(x1, x2, metric='sqeuclidean')
# 706 ms ± 4.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

注意：如果您有整数数组，您可能希望将 numba 函数中的硬编码 0.0 更改为 0。

【讨论】：

嗯...奇怪的是，我的 scipy 距离函数在我的测试中实际上慢了 2 倍，大约 4 秒。我可以知道您是否使用特殊选项编译 scipy？ @user2675516 你的数组有什么数据类型？对于某些 dtypes，scipy 函数可能会慢一些 - 但这只是一个猜测。也可能是您使用的是旧版本的 scipy。我认为你不能（或不应该）重新编译 scipy.这有点棘手......但如果你真的想要这里是the official instructions。我找到了罪魁祸首，我使用的是 float32，但 scipy.distance.cdist 的速度很慢。仅在 float64 上速度很快 @user2675516 是的，我怀疑是这样的。我认为在 scipy 错误跟踪器上打开一个问题可能是值得的。【参考方案3】：

尽管@MSeifert 的答案使这个答案相当过时，但我仍在发布它，因为它更详细地解释了为什么 numba 版本比 numpy 版本慢。

正如我们将看到的，罪魁祸首是 numpy 和 numba 的不同内存访问模式。

我们可以用更简单的函数重现该行为：

import numpy as np
import numba as nb

def just_sum(x2):
    return np.sum(x2, axis=1)

@nb.jit('double[:](double[:, :])', nopython=True)
def nb_just_sum(x2):
    return np.sum(x2, axis=1)

x2=np.random.random((2048,2048))

现在是时间安排：

>>> %timeit just_sum(x)
2.33 ms ± 71.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit nb_just_sum(x)
33.7 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

这意味着 numpy 大约快 15 倍！

在编译带有注释的 numba 代码时（例如numba --annotate-html sum.html numba_sum.py），我们可以看到，numba 是如何执行求和的（参见附录中求和的完整列表）：

初始化结果列将整个第一列添加到结果列中将整个第二列添加到结果列等等

这种方法有什么问题？内存布局！该数组以行优先顺序存储，因此按列读取它会导致比按行读取更多的缓存未命中（这是 numpy 所做的）。有a great article 解释了可能的缓存效果。

正如我们所见，numba 的 sum-implementation 还不是很成熟。但是，从上述考虑来看，numba 实现对于列主要顺序（即转置矩阵）可能具有竞争力：

>>> %timeit just_sum(x.T)
3.09 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit nb_just_sum(x.T)
3.58 ms ± 45.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

确实如此。

正如@MSeifert 的代码所示，numba 的主要优点是，在它的帮助下，我们可以减少临时 numpy 数组的数量。然而，一些看起来很容易的事情根本就不容易，一个天真的解决方案可能会很糟糕。建立一个总和就是这样一种操作——人们不应该认为一个简单的循环就足够了——例如参见this question。

列出numba-summation：

 Function name: array_sum_impl_axis
in file: /home/ed/anaconda3/lib/python3.6/site-packages/numba/targets/arraymath.py
with signature: (array(float64, 2d, A), int64) -> array(float64, 1d, C)
show numba IR
194:    def array_sum_impl_axis(arr, axis):
195:        ndim = arr.ndim
196:    
197:        if not is_axis_const:
198:            # Catch where axis is negative or greater than 3.
199:            if axis < 0 or axis > 3:
200:                raise ValueError("Numba does not support sum with axis"
201:                                 "parameter outside the range 0 to 3.")
202:    
203:        # Catch the case where the user misspecifies the axis to be
204:        # more than the number of the array's dimensions.
205:        if axis >= ndim:
206:            raise ValueError("axis is out of bounds for array")
207:    
208:        # Convert the shape of the input array to a list.
209:        ashape = list(arr.shape)
210:        # Get the length of the axis dimension.
211:        axis_len = ashape[axis]
212:        # Remove the axis dimension from the list of dimensional lengths.
213:        ashape.pop(axis)
214:        # Convert this shape list back to a tuple using above intrinsic.
215:        ashape_without_axis = _create_tuple_result_shape(ashape, arr.shape)
216:        # Tuple needed here to create output array with correct size.
217:        result = np.full(ashape_without_axis, zero, type(zero))
218:    
219:        # Iterate through the axis dimension.
220:        for axis_index in range(axis_len):
221:            if is_axis_const:
222:                # constant specialized version works for any valid axis value
223:                index_tuple_generic = _gen_index_tuple(arr.shape, axis_index,
224:                                                       const_axis_val)
225:                result += arr[index_tuple_generic]
226:            else:
227:                # Generate a tuple used to index the input array.
228:                # The tuple is ":" in all dimensions except the axis
229:                # dimension where it is "axis_index".
230:                if axis == 0:
231:                    index_tuple1 = _gen_index_tuple(arr.shape, axis_index, 0)
232:                    result += arr[index_tuple1]
233:                elif axis == 1:
234:                    index_tuple2 = _gen_index_tuple(arr.shape, axis_index, 1)
235:                    result += arr[index_tuple2]
236:                elif axis == 2:
237:                    index_tuple3 = _gen_index_tuple(arr.shape, axis_index, 2)
238:                    result += arr[index_tuple3]
239:                elif axis == 3:
240:                    index_tuple4 = _gen_index_tuple(arr.shape, axis_index, 3)
241:                    result += arr[index_tuple4]
242:    
243:        return result

【讨论】：

我喜欢你提到的天真的实现可能不像库函数那样“正确”。这通常是不必要的——但在极少数情况下，它确实很重要，可能会导致结果出现微妙（并且难以追踪）问题。重要的是要知道 NumPy 也使用不精确的求和，它只是不那么“不正确”，因为它使用成对求和（或至少是展开的部分求和）。如果需要非常高的准确性，可能应该使用Kahan or Neumaier summation 这里可能不那么相关，但使用 @nb.jit('double[:](double[:, :])', nopython=True) （声明可能不连续的数组）通常会破坏 SIMD 向量化。您可以使用自动类型检测或声明 C (double[:,::1]) 或 Fortran (double[::1,:] 连续数组。 @max9111 在这种特殊情况下没有区别，但很高兴知道！

以上是关于为啥这个 numba 代码比 numpy 代码慢 6 倍？的主要内容，如果未能解决你的问题，请参考以下文章

为啥在这里 numba 比 numpy 快？

为啥 blas 比 numpy 慢

Numba 代码比纯 python 慢

为啥同时使用 numba.cuda 和 CuPy 从 GPU 传输数据这么慢？

分配给数组时Numba慢吗？

为啥 TF Keras 推理方式比 Numpy 操作慢？