为类型化内存视图分配内存的推荐方法是啥？

Posted 2023-02-21

技术标签:

【中文标题】为类型化内存视图分配内存的推荐方法是啥？【英文标题】：What is the recommended way of allocating memory for a typed memory view?为类型化内存视图分配内存的推荐方法是什么？ 【发布时间】：2013-08-30 01:03:31 【问题描述】：

Cython documentation on typed memory views 列出了分配给类型化内存视图的三种方式：

np.ndarray

cython.view.array

假设我没有从外部将数据传递给我的 cython 函数，而是想分配内存并将其作为np.ndarray 返回，我选择了哪些选项？还假设该缓冲区的大小不是编译时常量，即我不能在堆栈上分配，但需要malloc 用于选项 1。

因此，这 3 个选项看起来像这样：

from libc.stdlib cimport malloc, free
cimport numpy as np
from cython cimport view

np.import_array()

def memview_malloc(int N):
    cdef int * m = <int *>malloc(N * sizeof(int))
    cdef int[::1] b = <int[:N]>m
    free(<void *>m)

def memview_ndarray(int N):
    cdef int[::1] b = np.empty(N, dtype=np.int32)

def memview_cyarray(int N):
    cdef int[::1] b = view.array(shape=(N,), itemsize=sizeof(int), format="i")

令我惊讶的是，在所有三种情况下，Cython generates quite a lot of code 用于内存分配，尤其是对__Pyx_PyObject_to_MemoryviewSlice_dc_int 的调用。这表明（我在这里可能错了，我对 Cython 内部工作原理的了解非常有限）它首先创建了一个 Python 对象，然后将其“投射”到内存视图中，这似乎是不必要的开销。

simple benchmark 并没有揭示这三种方法之间的太大区别，2. 是最快的。

推荐这三种方法中的哪一种？还是有其他更好的选择？

后续问题：在使用函数中的内存视图之后，我想最终将结果返回为 np.ndarray。类型化的内存视图是最佳选择，还是我宁愿只使用下面的旧缓冲区接口首先创建一个ndarray？

cdef np.ndarray[DTYPE_t, ndim=1] b = np.empty(N, dtype=np.int32)

【问题讨论】：

很好的问题，我想知道类似的事情。您的基准是我所知道的最佳答案。要回答后续问题，您可以以通常的方式声明您的 NumPy 数组（您甚至不必使用旧类型接口），然后执行cdef int[:] arrview = arr 之类的操作来查看用于相同内存的视图NumPy 数组。您可以使用视图进行快速索引和在 Cython 函数之间传递切片，同时仍然可以通过 NumPy 数组访问 NumPy 函数。完成后，您可以返回 NumPy 数组。有一个good related question here... 可以看到 np.empty 可能很慢... 【参考方案1】：

作为 Veedrac 回答的后续措施：请注意，在 python 2.7 中使用 cpython.array 的 memoryview 支持目前似乎会导致内存泄漏。这似乎是一个长期存在的问题，因为它在 2012 年 11 月的一篇文章中的 cython 用户邮件列表 here 中提到。使用 Python 2.7.6 和 Python 2.7.9 运行 Veedrac 的基准脚本与 Cython 版本 0.22 导致使用buffer 或memoryview 接口初始化cpython.array 时出现大量内存泄漏。使用 Python 3.4 运行脚本时不会发生内存泄漏。我已向 Cython 开发者邮件列表提交了一份错误报告。

【讨论】：

【参考方案2】：

查看here 以获得答案。

基本思想是你想要cpython.array.array和cpython.array.clone（不是 cython.array.*）：

from cpython.array cimport array, clone

# This type is what you want and can be cast to things of
# the "double[:]" syntax, so no problems there
cdef array[double] armv, templatemv

templatemv = array('d')

# This is fast
armv = clone(templatemv, L, False)

编辑

事实证明，该线程中的基准测试是垃圾。这是我的套装，还有我的时间安排：

# cython: language_level=3
# cython: boundscheck=False
# cython: wraparound=False

import time
import sys

from cpython.array cimport array, clone
from cython.view cimport array as cvarray
from libc.stdlib cimport malloc, free
import numpy as numpy
cimport numpy as numpy

cdef int loops

def timefunc(name):
    def timedecorator(f):
        cdef int L, i

        print("Running", name)
        for L in [1, 10, 100, 1000, 10000, 100000, 1000000]:
            start = time.clock()
            f(L)
            end = time.clock()
            print(format((end-start) / loops * 1e6, "2f"), end=" ")
            sys.stdout.flush()

        print("μs")
    return timedecorator

print()
print("INITIALISATIONS")
loops = 100000

@timefunc("cpython.array buffer")
def _(int L):
    cdef int i
    cdef array[double] arr, template = array('d')

    for i in range(loops):
        arr = clone(template, L, False)

    # Prevents dead code elimination
    str(arr[0])

@timefunc("cpython.array memoryview")
def _(int L):
    cdef int i
    cdef double[::1] arr
    cdef array template = array('d')

    for i in range(loops):
        arr = clone(template, L, False)

    # Prevents dead code elimination
    str(arr[0])

@timefunc("cpython.array raw C type")
def _(int L):
    cdef int i
    cdef array arr, template = array('d')

    for i in range(loops):
        arr = clone(template, L, False)

    # Prevents dead code elimination
    str(arr[0])

@timefunc("numpy.empty_like memoryview")
def _(int L):
    cdef int i
    cdef double[::1] arr
    template = numpy.empty((L,), dtype='double')

    for i in range(loops):
        arr = numpy.empty_like(template)

    # Prevents dead code elimination
    str(arr[0])

@timefunc("malloc")
def _(int L):
    cdef int i
    cdef double* arrptr

    for i in range(loops):
        arrptr = <double*> malloc(sizeof(double) * L)
        free(arrptr)

    # Prevents dead code elimination
    str(arrptr[0])

@timefunc("malloc memoryview")
def _(int L):
    cdef int i
    cdef double* arrptr
    cdef double[::1] arr

    for i in range(loops):
        arrptr = <double*> malloc(sizeof(double) * L)
        arr = <double[:L]>arrptr
        free(arrptr)

    # Prevents dead code elimination
    str(arr[0])

@timefunc("cvarray memoryview")
def _(int L):
    cdef int i
    cdef double[::1] arr

    for i in range(loops):
        arr = cvarray((L,),sizeof(double),'d')

    # Prevents dead code elimination
    str(arr[0])



print()
print("ITERATING")
loops = 1000

@timefunc("cpython.array buffer")
def _(int L):
    cdef int i
    cdef array[double] arr = clone(array('d'), L, False)

    cdef double d
    for i in range(loops):
        for i in range(L):
            d = arr[i]

    # Prevents dead-code elimination
    str(d)

@timefunc("cpython.array memoryview")
def _(int L):
    cdef int i
    cdef double[::1] arr = clone(array('d'), L, False)

    cdef double d
    for i in range(loops):
        for i in range(L):
            d = arr[i]

    # Prevents dead-code elimination
    str(d)

@timefunc("cpython.array raw C type")
def _(int L):
    cdef int i
    cdef array arr = clone(array('d'), L, False)

    cdef double d
    for i in range(loops):
        for i in range(L):
            d = arr[i]

    # Prevents dead-code elimination
    str(d)

@timefunc("numpy.empty_like memoryview")
def _(int L):
    cdef int i
    cdef double[::1] arr = numpy.empty((L,), dtype='double')

    cdef double d
    for i in range(loops):
        for i in range(L):
            d = arr[i]

    # Prevents dead-code elimination
    str(d)

@timefunc("malloc")
def _(int L):
    cdef int i
    cdef double* arrptr = <double*> malloc(sizeof(double) * L)

    cdef double d
    for i in range(loops):
        for i in range(L):
            d = arrptr[i]

    free(arrptr)

    # Prevents dead-code elimination
    str(d)

@timefunc("malloc memoryview")
def _(int L):
    cdef int i
    cdef double* arrptr = <double*> malloc(sizeof(double) * L)
    cdef double[::1] arr = <double[:L]>arrptr

    cdef double d
    for i in range(loops):
        for i in range(L):
            d = arr[i]

    free(arrptr)

    # Prevents dead-code elimination
    str(d)

@timefunc("cvarray memoryview")
def _(int L):
    cdef int i
    cdef double[::1] arr = cvarray((L,),sizeof(double),'d')

    cdef double d
    for i in range(loops):
        for i in range(L):
            d = arr[i]

    # Prevents dead-code elimination
    str(d)

输出：

INITIALISATIONS
Running cpython.array buffer
0.100040 0.097140 0.133110 0.121820 0.131630 0.108420 0.112160 μs
Running cpython.array memoryview
0.339480 0.333240 0.378790 0.445720 0.449800 0.414280 0.414060 μs
Running cpython.array raw C type
0.048270 0.049250 0.069770 0.074140 0.076300 0.060980 0.060270 μs
Running numpy.empty_like memoryview
1.006200 1.012160 1.128540 1.212350 1.250270 1.235710 1.241050 μs
Running malloc
0.021850 0.022430 0.037240 0.046260 0.039570 0.043690 0.030720 μs
Running malloc memoryview
1.640200 1.648000 1.681310 1.769610 1.755540 1.804950 1.758150 μs
Running cvarray memoryview
1.332330 1.353910 1.358160 1.481150 1.517690 1.485600 1.490790 μs

ITERATING
Running cpython.array buffer
0.010000 0.027000 0.091000 0.669000 6.314000 64.389000 635.171000 μs
Running cpython.array memoryview
0.013000 0.015000 0.058000 0.354000 3.186000 33.062000 338.300000 μs
Running cpython.array raw C type
0.014000 0.146000 0.979000 9.501000 94.160000 916.073000 9287.079000 μs
Running numpy.empty_like memoryview
0.042000 0.020000 0.057000 0.352000 3.193000 34.474000 333.089000 μs
Running malloc
0.002000 0.004000 0.064000 0.367000 3.599000 32.712000 323.858000 μs
Running malloc memoryview
0.019000 0.032000 0.070000 0.356000 3.194000 32.100000 327.929000 μs
Running cvarray memoryview
0.014000 0.026000 0.063000 0.351000 3.209000 32.013000 327.890000 μs

（“迭代”基准的原因是某些方法在这方面具有惊人的不同特征。）

按初始化速度排序：

malloc：这是一个严酷的世界，但速度很快。如果您需要分配很多东西并具有不受阻碍的迭代和索引性能，那就必须这样做。但通常你是一个不错的选择......

cpython.array raw C type: 该死的，速度很快。而且是安全的。不幸的是，它通过 Python 访问其数据字段。你可以通过一个绝妙的技巧来避免这种情况：

arr.data.as_doubles[i]

在消除安全性的同时使其达到标准速度！这使得它成为malloc 的精彩替代品，基本上是一个非常参考计数的版本！

cpython.array buffer：设置时间仅为malloc 的三到四倍，这看起来是个不错的选择。不幸的是，它有很大的开销（尽管与boundscheck 和wraparound 指令相比很小）。这意味着它只真正与完全安全的变体竞争，但它是那些初始化速度最快的。您的选择。

cpython.array memoryview：现在这比malloc 的初始化慢一个数量级。这是一种耻辱，但它的迭代速度一样快。这是我建议的标准解决方案，除非boundscheck 或wraparound 处于打开状态（在这种情况下cpython.array buffer 可能是一个更引人注目的权衡）。

剩下的。唯一值得一提的是numpy，因为对象附带了许多有趣的方法。不过就是这样。

【讨论】：

感谢您的全面调查并提供数字支持！很好的答案！我是否认为只有纯 malloc 解决方案才能完全避免获得 GIL 的需要？我对在并行工作线程中分配多维数组的方法很感兴趣。试用并报告！人们似乎确实发现这很有用。我看看能不能加进去。 cpython.array 已在docs.cython.org/src/tutorial/array.html 中进行了描述。代码应更改为包含“原始 C 类型”基准测试的“arr.data.as_doubles[i]”技巧，因为没有该索引绝对不是原始的（当前可以称为“普通 cpython.array”索引，但它不是一个有趣的数据点）。

以上是关于为类型化内存视图分配内存的推荐方法是啥？的主要内容，如果未能解决你的问题，请参考以下文章