Cython 中 C++ 函数的性能不佳

Posted 2023-02-23

技术标签:

【中文标题】Cython 中 C++ 函数的性能不佳【英文标题】：Poor performance of C++ function in Cython 【发布时间】：2017-09-29 20:11:49 【问题描述】：

我有这个 C++ 函数，我可以使用下面的代码从 Python 调用它。与运行纯 C++ 相比，性能只有一半。有没有办法让他们的表现保持在同一水平？我用-Ofast -march=native 标志编译了这两个代码。我不明白我在哪里可以损失 50%，因为大部分时间应该花在 C++ 内核上。 Cython 是否在制作我可以避免的内存副本？

namespace diff

    void diff_cpp(double* __restrict__ at, const double* __restrict__ a, const double visc,
                  const double dxidxi, const double dyidyi, const double dzidzi,
                  const int itot, const int jtot, const int ktot)
    
        const int ii = 1;
        const int jj = itot;
        const int kk = itot*jtot;

        for (int k=1; k<ktot-1; k++)
            for (int j=1; j<jtot-1; j++)
                for (int i=1; i<itot-1; i++)
                
                    const int ijk = i + j*jj + k*kk;
                    at[ijk] += visc * (
                            + ( (a[ijk+ii] - a[ijk   ]) 
                              - (a[ijk   ] - a[ijk-ii]) ) * dxidxi 
                            + ( (a[ijk+jj] - a[ijk   ]) 
                              - (a[ijk   ] - a[ijk-jj]) ) * dyidyi
                            + ( (a[ijk+kk] - a[ijk   ]) 
                              - (a[ijk   ] - a[ijk-kk]) ) * dzidzi
                            );

我有这个.pyx 文件

# import both numpy and the Cython declarations for numpy
import cython
import numpy as np
cimport numpy as np

# declare the interface to the C code
cdef extern from "diff_cpp.cpp" namespace "diff":
    void diff_cpp(double* at, double* a, double visc, double dxidxi, double dyidyi, double dzidzi, int itot, int jtot, int ktot)

@cython.boundscheck(False)
@cython.wraparound(False)
def diff(np.ndarray[double, ndim=3, mode="c"] at not None,
         np.ndarray[double, ndim=3, mode="c"] a not None,
         double visc, double dxidxi, double dyidyi, double dzidzi):
    cdef int ktot, jtot, itot
    ktot, jtot, itot = at.shape[0], at.shape[1], at.shape[2]
    diff_cpp(&at[0,0,0], &a[0,0,0], visc, dxidxi, dyidyi, dzidzi, itot, jtot, ktot)
    return None

我在 Python 中调用了这个函数

import numpy as np
import diff
import time

nloop = 20;
itot = 256;
jtot = 256;
ktot = 256;
ncells = itot*jtot*ktot;

at = np.zeros((ktot, jtot, itot))

index = np.arange(ncells)
a = (index/(index+1))**2
a.shape = (ktot, jtot, itot)

# Check results
diff.diff(at, a, 0.1, 0.1, 0.1, 0.1)
print("at=0".format(at.flatten()[itot*jtot+itot+itot//2]))

# Time the loop
start = time.perf_counter()
for i in range(nloop):
    diff.diff(at, a, 0.1, 0.1, 0.1, 0.1)
end = time.perf_counter()

print("Time/iter: 0 s (1 iters)".format((end-start)/nloop, nloop))

这是setup.py：

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

import numpy

setup(
    cmdclass = 'build_ext': build_ext,
    ext_modules = [Extension("diff",
                             sources=["diff.pyx"],
                             language="c++",
                             extra_compile_args=["-Ofast -march=native"],
                             include_dirs=[numpy.get_include()])],
)

这里是达到两倍性能的 C++ 参考文件：

#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <stdlib.h>
#include <cstdio>
#include <ctime>
#include "math.h"

void init(double* const __restrict__ a, double* const __restrict__ at, const int ncells)

    for (int i=0; i<ncells; ++i)
    
        a[i]  = pow(i,2)/pow(i+1,2);
        at[i] = 0.;
    


void diff(double* const __restrict__ at, const double* const __restrict__ a, const double visc, 
          const double dxidxi, const double dyidyi, const double dzidzi, 
          const int itot, const int jtot, const int ktot)

    const int ii = 1;
    const int jj = itot;
    const int kk = itot*jtot;

    for (int k=1; k<ktot-1; k++)
        for (int j=1; j<jtot-1; j++)
            for (int i=1; i<itot-1; i++)
            
                const int ijk = i + j*jj + k*kk;
                at[ijk] += visc * (
                        + ( (a[ijk+ii] - a[ijk   ]) 
                          - (a[ijk   ] - a[ijk-ii]) ) * dxidxi 
                        + ( (a[ijk+jj] - a[ijk   ]) 
                          - (a[ijk   ] - a[ijk-jj]) ) * dyidyi
                        + ( (a[ijk+kk] - a[ijk   ]) 
                          - (a[ijk   ] - a[ijk-kk]) ) * dzidzi
                        );
            


int main()

    const int nloop = 20;
    const int itot = 256;
    const int jtot = 256;
    const int ktot = 256;
    const int ncells = itot*jtot*ktot;

    double *a  = new double[ncells];
    double *at = new double[ncells];

    init(a, at, ncells);

    // Check results
    diff(at, a, 0.1, 0.1, 0.1, 0.1, itot, jtot, ktot); 
    printf("at=%.20f\n",at[itot*jtot+itot+itot/2]);

    // Time performance 
    std::clock_t start = std::clock(); 

    for (int i=0; i<nloop; ++i)
        diff(at, a, 0.1, 0.1, 0.1, 0.1, itot, jtot, ktot); 

    double duration = (std::clock() - start ) / (double)CLOCKS_PER_SEC;

    printf("time/iter = %f s (%i iters)\n",duration/(double)nloop, nloop);

    return 0;

【问题讨论】：

你的 c++ 测试代码是什么样的？也许还有设置文件？ @ead。我两个都加了。这可能是一个错字，但您的 C++ 源代码中的 nloop 是 Python 源代码中的一半，即 10 对 20。这当然可以解释两个性能差异的因素。 @bnaecker。你是对的，但我将基准时间除以nloop，所以这并不能解释差异。我还是改了，谢谢！ @Chiel 是的。 #pragma ivdep 呢？在纯 C++ 情况下，编译器可以进行向量化。您是否尝试过删除它并进行比较？ 【参考方案1】：

这里的问题不是在运行过程中发生了什么，而是在编译过程中发生了哪些优化。

进行哪种优化取决于编译器（甚至版本），并且不能保证可以完成的每一个优化都会完成。

其实cython比较慢有两个不同的原因，看你用的是g++还是clang++：

由于 cython 构建中的标志 -fwrapv，g++ 无法优化 clang++ 一开始就无法优化（继续阅读以了解会发生什么）。

第一个问题 (g++)：与纯 c++ 程序的标志相比，Cython 使用不同的标志进行编译，因此无法进行一些优化。

如果您查看设置日志，您将看到：

 x86_64-linux-gnu-gcc ... -O2 ..-fwrapv .. -c diff.cpp ... -Ofast -march=native

正如你所说，-Ofast 将战胜 -O2，因为它排在最后。但是问题是-fwrapv，这似乎阻止了一些优化，因为有符号整数溢出不能被认为是UB，不再用于优化。

所以你有以下选择：

将-fno-wrapv 添加到extra_compile_flags，缺点是现在所有文件都使用更改的标志进行编译，这可能是不需要的。从 cpp 构建一个仅包含您喜欢的标志的库，并将其链接到您的 cython 模块。此解决方案有一些开销，但具有强大的优势：正如您指出的不同编译器的不同 cython 标志可能是问题 - 所以第一个解决方案可能太脆弱了。不确定是否可以禁用默认标志，但文档中可能有一些信息。

第二期 (clang++) 在测试 cpp 程序中内联。

当我用我相当旧的 5.4 版 g++ 编译你的 cpp 程序时：

 g++ test.cpp -o test -Ofast -march=native -fwrapv

与没有 -fwrapv 的编译相比，它几乎慢了 3 倍。然而，这是优化器的一个弱点：当内联时，它应该看到不可能有符号整数溢出（所有维度都大约是 256），所以标志 -fwrapv 应该没有任何影响。

我的旧clang++-version (3.8) 似乎在这里做得更好：使用上面的标志，我看不到性能有任何下降。我需要通过-fno-inline 禁用内联以成为更慢的代码，但即使没有-fwrapv 也会更慢，即：

 clang++ test.cpp -o test -Ofast -march=native -fno-inline

因此存在系统性偏向于您的 c++ 程序：优化器可以在内联后针对已知值优化代码 - 这是 cython 无法做到的。

所以我们可以看到：clang++ 无法针对任意大小优化function diff，但能够针对 size=256 对其进行优化。然而，Cython 只能使用未优化的 diff 版本。这就是为什么-fno-wrapv 没有积极影响的原因。

我的收获：不允许在 cpp-tester 中内联感兴趣的函数（例如，在其自己的目标文件中编译它）以确保与 cython 保持一致，否则会看到一个特别的程序的性能针对这一输入进行了优化。

NB：有趣的是，如果所有ints 都被unsigned ints 替换，那么-fwrapv 自然不会起任何作用，但是带有unsigned int 的版本和@987654343 一样慢@-version 与 -fwrapv，这只是合乎逻辑的，因为没有可利用的未定义行为。

【讨论】：

据我所知，最后一个标志获胜，否则如果标志已经设置，则无法覆盖。 @Chiel 好的，你是对的，但是其他标志中有一些东西会干扰优化 - 对于通过 cython 编译的文件和直接使用 g++ 编译的文件，程序集完全不同这不是真的。如果我用 Cython 编译的所有标志编译快速 c++ 代码，那么我仍然会重现性能差异。你能重现差异吗？ @Chiel With

g++ test.cpp -o test -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -ftrapv -Ofast -march=native

- 0.20 秒 @Chiel -ftrapv 覆盖 -ftrapw 但现在 -ftrapv 阻止优化。我的设置线是extra_compile_args=['-std=c++11', '-Ofast', '-march=native', '-fno-wrapv']，我得到的速度与使用 cpp 完全相同

以上是关于Cython 中 C++ 函数的性能不佳的主要内容，如果未能解决你的问题，请参考以下文章

从 C++ 函数 Cython 返回包含 PyObject 的复杂对象

Cython - 将 C++ 函数返回的 C++（向量和非向量）对象暴露给 Python

在 Cython 中访问 C++ 类的私有成员变量/函数

如何在 Cython 中构建 iostream 对象（例如 cout）并将其传递给 c++ 函数？

用函数指针包装 C++ 代码作为 cython 中的模板参数

如何使用 Cython 向 Python 公开返回 C++ 对象的函数？