使用内在函数提高数组仿射变换的速度

Posted 2023-02-16

技术标签:

【中文标题】使用内在函数提高数组仿射变换的速度【英文标题】：Improving speed of affine transformation of an array using intrinsics 【发布时间】：2021-04-15 09:40:05 【问题描述】：

在性能敏感的代码中，我必须执行向量的仿射变换： Y=a*X+b 其中Y 和X 是向量，a 和b 是标量。作为提高计算速度的一种快速而简单的方法，我将并行化委托给 openMP #pragma omp simd 指令。最近有一些空闲时间，我尝试直接使用内在函数来实现它，获得与omp 解决方案大致相同的性能。有没有办法打败 OMP 矢量化？我可以用完 AVX2 指令。

以下代码在windows 10下测试，用VS 2019编译。

#include <iostream>
#include <armadillo>
#include <chrono>
#include <immintrin.h>

///Computes y=alpha*x+beta
inline void SumAndSetOmp(
    arma::Col<double>& y        /**< Result*/,
    const arma::Col<double>& x  /**< Input*/,
    const double& alpha         /**< Coefficient*/,
    const double& beta          /**< Offset*/)

    auto* __restrict lhs = y.memptr();
    const auto* __restrict add_rhs = x.memptr();
    const auto& n = x.n_elem;
#pragma omp simd
    for (arma::uword i = 0; i < n; ++i)
    
        lhs[i] = add_rhs[i] * alpha + beta;
    


inline void SumAndSetSerial(
    arma::Col<double>& y        /**< Result*/,
    const arma::Col<double>& x  /**< Input*/,
    const double& alpha         /**< Coefficient*/,
    const double& beta          /**< Offset*/)

    auto* lhs = y.memptr();
    const auto* add_rhs = x.memptr();
    const auto& n = x.n_elem;
    for (arma::uword i = 0; i < n; ++i)
    
        lhs[i] = add_rhs[i] * alpha + beta;
    


inline void SumAndSetAVX(arma::Col<double>& y       /**< Result*/,
    const arma::Col<double>& x              /**< Input*/,
    const double& alpha                     /**< Coefficient*/,
    const double& beta                      /**< Offset*/)

    //Allocate  coefficients
    const auto alphas = _mm256_set1_pd(alpha);
    const auto betas = _mm256_set1_pd(beta);

    //Extracting memory addresses
    auto* __restrict pos_lhs = y.memptr();
    const auto* __restrict pos_rhs = x.memptr();

    //Computing sizes
    const unsigned int length_array = 4;
    const unsigned long long n_aligned = x.n_elem / length_array;
    const unsigned int remainder = x.n_elem % length_array;

    //Performing AVX instruction
    for (unsigned long long i = 0; i < n_aligned; i++) 
        const __m256d x_avx = _mm256_loadu_pd(pos_rhs);
        const __m256d y_avx = _mm256_fmadd_pd(x_avx, alphas, betas);
        _mm256_storeu_pd(pos_lhs, y_avx);
        pos_rhs += length_array;
        pos_lhs += length_array;
    

    //Process the rest serially
    for (unsigned int i = 0; i < remainder; i++) 
        pos_lhs[i] = alpha * pos_rhs[i] + beta;
    


enum method

    serial,
    omp,
    avx
;

arma::vec perform_test(const arma::vec& x, const method mtd, int trials = 100, const double alpha = 3.0, const double beta = 5.0)

    arma::Col<double> res(x.n_elem);

    const auto beg = std::chrono::steady_clock::now();
    switch (mtd) 
    case serial:
        for (int i = 0; i < trials; i++)
            SumAndSetSerial(res, x, alpha, beta);
        break;
    case omp:
        for (int i = 0; i < trials; i++)
            SumAndSetOmp(res, x, alpha, beta);
        break;
    case avx:
        for (int i = 0; i < trials; i++)
            SumAndSetAVX(res, x, alpha, beta);
        break;
    
    std::cout << "time:" << std::chrono::duration<double>(std::chrono::steady_clock::now() - beg).count() << "s\n";
    return res;



//Benchmarking
double test_fun(long long int n,int trials=100, const double alpha = 3.0, const double beta = 5.0)

    const arma::Col<double> x(n, arma::fill::randn);
    
    const arma::Col<double> reference = alpha*x + beta;
    std::cout << "Serial: ";
    const auto res_serial = perform_test(x, method::serial, trials, alpha, beta);
    std::cout << "OMP: ";
    const auto res_omp = perform_test(x, method::omp, trials, alpha, beta);
    std::cout << "AVX: ";
    const auto res_avx = perform_test(x, method::avx, trials, alpha, beta);
    // errors wrt the reference
    const double err_serial = arma::max(arma::abs(reference - res_serial));
    const double err_avx = arma::max(arma::abs(reference - res_avx));
    const double err_omp = arma::max(arma::abs(reference - res_omp));

    //Largest error
    const double error = std::max(std::max(err_serial, err_avx), err_omp);
    if (error> 1e-6)
    
        throw std::runtime_error("Something is wrong!");
    
    return error;


int main()

    test_fun(10000000);

【问题讨论】：

犰狳不应该在本地做这样的操作吗？您是否也只有 AVX2 或 FMA？ n 有多大？我用 Eigen 进行了测试（我知道它确实在幕后做了很多优化），它生成的 asm 与您在此处使用的内在函数版本非常相似，因此您可能非常接近最优你可以做什么。我第二个想知道n 的问题（因为如果在编译时就知道了，你可以进一步优化）。可能值得在组装时达到峰值以确保基准测试确实有效（即它没有优化所有代码），如果这对性能至关重要（肯定是-O3），那么值得使用编译标志 Armadillo 将这些委托给编译器，例如它甚至不使用 ?axpy 进行此类操作。 n 的值是动态的，通常在 1000-10000 范围内。指令方面，我假设一个相当新的处理器，除了 AVX512 之外的所有指令 【参考方案1】：

我尝试改进您的解决方案，但我认为它们是最佳的，因此您最好坚持使用 omp。

以下内容是推测性的，超出了您的问题范围：

如果您可以尝试 omp 的多线程，我总是对您获得合理提升的元素数量如此之低感到惊讶，尽管我希望使用简单仿射变换的 1000 个元素也会如此低的。如果你的算法的其他部分是可并行的，那么这更有可能是有帮助的。

如果你有能力改变你的问题并且不需要双精度，你可以使用浮点数。

【讨论】：

我在并行部分中使用了这段代码，这就是为什么我通过内在路线，试图充分利用它。

以上是关于使用内在函数提高数组仿射变换的速度的主要内容，如果未能解决你的问题，请参考以下文章

仿射变换

opencv中的仿射变换

OpenGL基础仿射变换原理解析

python-opencv几何变换--仿射变换透视变换

【转】仿射变换及其变换矩阵的理解

利用OpenCV的仿射变换函数warpAffine()实现图像的亚像素级平移