犰狳 inplace_plus 明显慢于“正常”加操作

Posted 2023-02-17

技术标签:

【中文标题】犰狳 inplace_plus 明显慢于“正常”加操作【英文标题】：Armadillo inplace_plus significantly slower than "normal" plus operation 【发布时间】：2014-11-05 20:50:09 【问题描述】：

我正在使用 Armadillo 4.500.0 编写程序，并且我体验到诸如 s += v * v.t() * q; 之类的就地计算比等效的 s = s + v * v.t() * q; 慢得多，其中 s、v 和 q 是大小合适。

当我运行以下代码时，结果发现就地版本比其他版本慢很多倍，对于 500 个元素，速度慢了约 480 倍（5.13 秒到 0.011 秒），并进行了积极优化（-O3 或 -Ofast；Apple LLVM 版本 6.0 (clang-600.0.54))。

#include <iostream>
#include <armadillo>
#include <sys/time.h>

using namespace arma;
using namespace std;

#define N_ELEM 500
#define REP 10000

int main(int argc, const char * argv[]) 
    timeval start;
    timeval end;
    double tInplace, tNormal;
    vec s = randu<vec>(N_ELEM);
    vec v = randu<vec>(N_ELEM);
    vec q = randu<vec>(N_ELEM);

    gettimeofday(&start, NULL);

    for(int i = 0; i < REP; ++i) 
        s += v * v.t() * q;
    

    gettimeofday(&end, NULL);

    tInplace = (end.tv_sec - start.tv_sec + ((end.tv_usec - start.tv_usec) / 1e6));

    gettimeofday(&start, NULL);

    for(int i = 0; i < REP; ++i) 
        s = s + v * v.t() * q;
    

    gettimeofday(&end, NULL);

    tNormal = (end.tv_sec - start.tv_sec + ((end.tv_usec - start.tv_usec) / 1e6));

    cout << "Inplace: " << tInplace << "; Normal: " << tNormal << " --> " << "Normal is " << tInplace / tNormal << " times faster" << endl;

    return 0;

任何人都可以解释为什么 inplace 运算符的性能如此糟糕，尽管它可以使用已经可用的内存，所以它不需要复制任何东西？

【问题讨论】：

测试未经优化编译的代码的性能会得到愚蠢的结果。通过优化进行测试，谜团可能会消失。有些函数在没有优化的情况下编译会慢数百倍。 @Schwartz：感谢您的提示，但即使采用积极优化（-O3 或 -Ofast），性能差异也约为 480 倍（5.13 秒到 0.011 秒）（我相应地更新了问题）。 @DavidK - 也许向 Armadillo 开发人员发送错误报告？ 【参考方案1】：

在v.t() * q 周围加上括号可以解决问题：

for(int i = 0; i < REP; ++i) 
    s += v * (v.t() * q);

使用括号强制计算顺序。表达式 (v.t() * q) 将计算为一个标量（技术上是一个 1x1 矩阵），然后用于乘以 v 向量。方括号还可以防止 v * v.t() 变成显式的外部产品。

犰狳可以在使用s = s + v * v.t() * q 表达式时自动解决这个问题，但在使用就地运算符+= 时它（当前）需要更多提示。

【讨论】：

谢谢，就像一个魅力！它甚至比我为解决它而编写的手动循环快约 10%。但由于我也必须先对向量进行范数，因此手动循环比使用两个版本中的任何一个都快 1.4 倍，因为我可以同时计算 v 和点积 v.t() * q 的范数。

以上是关于犰狳 inplace_plus 明显慢于“正常”加操作的主要内容，如果未能解决你的问题，请参考以下文章

为什么在Hive中计数（明显）慢于group by？

在繁重的模拟代码中，c++ 类/结构会明显慢于 c-array [关闭]

使用JNDI明显慢于persistence.xml中的显式连接（Jetty 9 / Hibernate）

C ++犰狳生成给定vec或矩阵的索引uvec而不循环它

在 Xcode 4 中链接和编译犰狳

广播类似于 Numpy 的犰狳矩阵运算的最佳方式