在简单的逐行计算任务中，为啥犰狳与 C 风格的数组相比如此缓慢

Posted 2023-02-16

技术标签:

【中文标题】在简单的逐行计算任务中，为啥犰狳与 C 风格的数组相比如此缓慢【英文标题】：Why is Armadillo so slow compared to a C-style array in a simple row-wise computationnal task在简单的逐行计算任务中，为什么犰狳与 C 风格的数组相比如此缓慢 【发布时间】：2018-11-23 15:04:07 【问题描述】：

我目前正在为大矩阵的每个值（数百万行，列数

更准确地说，对于每一行i中的每个值M(i,j)，列j，数量就是 [ M(i,j) - mean(i ,s) ] / std(i,s) 其中 s 是子集 s in M(i,:) - j 换句话说，s 是行 i 的所有值的子集，没有值 j。

我比较了两种实现，一种是 C 风格的数组，另一种是 Armadillo，就执行时间而言，Armadillo 大约慢了两倍。我预计执行时间会相似或稍慢，但纯 C 数组似乎可以显着提高性能。

有什么特别的原因或某事我错过了某处吗？这是一个使用以下代码编译的示例：-O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm。还尝试使用ARMA_NO_DEBUG，但没有成功。

#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>

using namespace std::chrono;

/***************************
 * main()
 ***************************/
int main( int argc, char *argv[] )

    unsigned nrows = 2000000; //number of rows
    unsigned ncols = 100; //number of cols

    const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix

    const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n]
    arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step.
    arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector.

    high_resolution_clock::time_point t1 = high_resolution_clock::now();

    //compute some normalization over each value of line of this huge matrix:
    for(unsigned i=0; i < nrows; i++) 
        const arma::rowvec current_line = huge_mat.row(i); //extract current line

        //for each observation in current_line:
        for(unsigned j=0; j < ncols; j++) 

            //Take care of side effects first:
            if( j == 0 )
                inds = current_line(arma::span(1, ncols-1));
            else
                if( j == 1 ) 
                    inds(0) = current_line(0);
                    inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) );
                 else
                    inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );

            //Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
            simuT(j) = (current_line(j) - arma::mean(inds))  / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) );
        
    

    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    auto duration = duration_cast<seconds>( t2 - t1 ).count();
    std::cout << "ARMADILLO: " << duration << " secs\n";

    //------------------PLAIN C Array
    double *Mat_full;
    double *output;
    unsigned int i,j,k;
    double mean=0, stdd=0;
    double sq_diff_sum = 0, sum=0;
    double diff = 0;

    Mat_full = (double *) malloc(ncols * nrows * sizeof(double));
    output = (double *) malloc(nrows * ncols * sizeof(double));

    std::vector< std::vector<double> > V(huge_mat.n_rows);

    //Some UGLY copy from arma::mat to double* using a vector:
    for (size_t i = 0; i < huge_mat.n_rows; ++i)
        V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i));

    //then dump to Mat_full array:
    for (i=0; i < V.size(); i++)
        for (j=0; j < V[i].size(); j++)
            Mat_full[i + huge_mat.n_rows * j] = V[i][j];

    t1 = high_resolution_clock::now();

    for(i=0; i < nrows; i++)
        for(j=0; j < ncols; j++)
        
            //compute mean of subset-------------------
            sum = 0;
            for(k = 0; k < ncols; k++)
                if(k!=j)
                
                    sum = sum + Mat_full[i+k*nrows];
                
            mean = sum / (ncols-1);

            //compute standard deviation of subset-----
            sq_diff_sum = 0;
            for(k = 0; k < ncols; k++)
                if(k!=j)
                
                    diff = Mat_full[i+k*nrows] - mean;
                    sq_diff_sum += diff * diff;
                
            stdd = sqrt(sq_diff_sum / (ncols-2));

            //export to plain C array:
            output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd);
        

    t2 = high_resolution_clock::now();
    duration = duration_cast<seconds>( t2 - t1 ).count();
    std::cout << "C ARRAY: " << duration << " secs\n";

在比较执行时间时，特别是对 arma::mean 和 arma::stddev 的调用似乎表现不佳。我没有对大小对性能的影响进行任何深入分析，但似乎对于 nrows 的小值，普通 C 往往（非常）快。对于使用这个的简单测试我得到的设置：

ARMADILLO: 111 secs
C ARRAY: 79 secs

在执行时间。

编辑正如@rubenvb 和@mtall 所建议的那样，这是我们按列而不是按行工作并独立处理每一列的修改。由此产生的执行时间略有减少（ARMADILLO: 104 secs now），因此显示出比按行工作的一些改进：

#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <algorithm>
#include <armadillo>
#include <chrono>

using namespace std::chrono;

/***************************
 * main()
 ***************************/
int main( int argc, char *argv[] )

    unsigned nrows = 100; //number of rows
    unsigned ncols = 2000000; //number of cols

    const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix

    const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n]
    arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step.
    arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector.

    high_resolution_clock::time_point t1 = high_resolution_clock::now();

    //compute some normalization over each value of line of this huge matrix:
    for(unsigned i=0; i < ncols; i++) 
        const arma::colvec current_line = huge_mat.col(i); //extract current line

        //for each observation in current_line:
        for(unsigned j=0; j < nrows; j++) 

            //Take care of side effects first:
            if( j == 0 )
                inds = current_line(arma::span(1, nrows-1));
            else
                if( j == 1 ) 
                    inds(0) = current_line(0);
                    inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) );
                 else
                    inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) );

            //Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line.
            simuT(j) = (current_line(j) - arma::mean(inds))  / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) );
        
    

    high_resolution_clock::time_point t2 = high_resolution_clock::now();
    auto duration = duration_cast<seconds>( t2 - t1 ).count();
    std::cout << "ARMADILLO: " << duration << " secs\n";

【问题讨论】：

我会尝试的第一件事是尽可能将您的计算重新表述为矩阵运算。我要尝试的第二件事是交换行和列，这取决于基础数据可能会让你在这里不断地缓存丢失。不相关：对两个代码都进行了一些优化。可以避免测试k != j，然后减去j对应的词。小心，@Damien。计算机算术，尤其是浮点算术，不是代数的。操作顺序可能很重要，因此您建议的优化可能会导致计算出不同的结果。这似乎是所谓的abstraction penalty 的一个例子，在这种情况下，您通过稍微降低速度来“支付”良好的语法和易于使用的功能。在您的代码中，无需通过 std::vector 中介将数据洗牌到Mat_full。您可以通过.memptr() 成员函数直接访问犰狳矩阵中的矩阵数据。示例：double* data = huge_mat.memptr()。这意味着您可以在需要时直接使用类似 C 的访问权限。 @Grasshoper 还要考虑到犰狳以column major 格式存储数据。看起来您的代码逐行访问数据。这不是一个很好的匹配。 【参考方案1】：

原因是犰狳使用column-major ordering in mat，而您的 C 数组使用行优先排序。这很重要，因为您的处理器可以使用 instruction vectorization 一次处理多个元素，这需要连续的内存块。

要验证这是否是原因，请执行相同的计算，但针对列而不是行，并检查差异。

【讨论】：

谢谢，我已经在原帖的编辑中做了推荐的修改，提高了效率。

以上是关于在简单的逐行计算任务中，为啥犰狳与 C 风格的数组相比如此缓慢的主要内容，如果未能解决你的问题，请参考以下文章