犰狳 vs for 循环向量乘法

Posted 2023-02-16

技术标签:

【中文标题】犰狳 vs for 循环向量乘法【英文标题】：Armadillo vs for loop vectors multiplications 【发布时间】：2018-12-06 15:50:33 【问题描述】：

当我需要将两个复数向量相乘时，我想比较犰狳的性能。我写了一个简单的测试来计算处理时间。乘法以两种方式实现：Armadillo 元素乘法和 std::vector 上的简单 for 循环。以下是测试源：

#include <iostream>
#include <armadillo>
#include <stdlib.h>
using namespace std;
using namespace arma;
#include <complex>
#include <chrono>
using namespace std::chrono;
#define VEC_SIZE 204800
main(int argc, char** argv) 

    const int iterations = 1000;

    cout << "Armadillo version: " << arma_version::as_string() << endl;
    //duration<double> lib_cnt, vec_cnt;
    uint32_t lib_cnt = 0, vec_cnt = 0;

    for (int it = 0; it < iterations; it++) 
        // init input vectors
        std::vector<complex<float>> vf1(VEC_SIZE);
        std::fill(vf1.begin(), vf1.end(), complex<float>(4., 6.));
        std::vector<complex<float>> vf2(VEC_SIZE);
        std::fill(vf2.begin(), vf2.end(), 5.);

        std::vector<complex<float>> vf_res(VEC_SIZE);

        // init arma vectors
        Col<complex<float>> vec1(vf1);
        Col<complex<float>> vec2(vf2);


        // time for loop duration
        auto t0 = high_resolution_clock::now();
        for (int vec_idx = 0; vec_idx < VEC_SIZE; vec_idx++) 
            vf_res[vec_idx] = vf1[vec_idx] * vf2[vec_idx];
        
        auto t1 = high_resolution_clock::now();

        vec_cnt += duration_cast<milliseconds>(t1 - t0).count();

        for (int vec_idx = 0; vec_idx < VEC_SIZE; vec_idx++) 
            complex<float> s = vf_res[vec_idx];
        


        Col<complex<float>> mul_res(VEC_SIZE);

        // time arma element wise duration
        t0 = high_resolution_clock::now();
        mul_res = vec1 % vec2;
        t1 = high_resolution_clock::now();
        lib_cnt += duration_cast<milliseconds>(t1 - t0).count();

    
    cout << "for loop time " << vec_cnt << " msec\n";
    cout << "arma time " << lib_cnt << " msec\n";

    return 0;

结果如下：

$ g++ example1.cpp -o example1 -O2 -larmadillo 
$ ./example1
Armadillo version: 9.200.5 (Carpe Noctem)
for loop time 2060 msec
arma time 3049 msec

我希望犰狳可以比简单的循环更快地繁殖。还是我错了？是否期望 for 循环将两个向量相乘更快？

【问题讨论】：

【参考方案1】：

这不是问题的答案，更像是一种观察。如果您将代码重组为两个单独的循环：

#define VEC_SIZE 204800
main(int argc, char** argv)

    const int iterations = 1000;
    cout << "Armadillo version: " << arma_version::as_string() << endl;
    //duration<double> lib_cnt, vec_cnt;
    uint32_t lib_cnt = 0, vec_cnt = 0;

    // init input vectors
   std::vector<complex<float>> vf1(VEC_SIZE);
   std::fill(vf1.begin(), vf1.end(), complex<float>(4., 6.));
   std::vector<complex<float>> vf2(VEC_SIZE);
   std::fill(vf2.begin(), vf2.end(), 5.);
   std::vector<complex<float>> vf_res(VEC_SIZE);

   // init arma vectors
   Col<complex<float>> vec1(vf1);
   Col<complex<float>> vec2(vf2);
   Col<complex<float>> mul_res(VEC_SIZE);
   high_resolution_clock::time_point t0,t1;
   for (int it = 0; it < iterations; it++)
      // time for loop duration
      t0 = high_resolution_clock::now();
      for (int vec_idx = 0; vec_idx < VEC_SIZE; vec_idx++)
          vf_res[vec_idx] = vf1[vec_idx] * vf2[vec_idx];
      
      t1 = high_resolution_clock::now();
      vec_cnt += duration_cast<milliseconds>(t1 - t0).count();
#if 1
      
      for (int it = 0; it < iterations; it++)
#endif
      // time arma element wise duration
      t0 = high_resolution_clock::now();
      mul_res = vec1 % vec2;
      t1 = high_resolution_clock::now();
      lib_cnt += duration_cast<milliseconds>(t1 - t0).count();
    
    cout << "for loop time " << vec_cnt << " msec\n";
    cout << "arma time " << lib_cnt << " msec\n";

    return 0;

然后结果来自

Armadillo version: 8.500.1 (Caffeine Raider)
for loop time 169 msec
arma time 244 msec

到

Armadillo version: 8.500.1 (Caffeine Raider)
for loop time 187 msec
arma time 22 msec

这更像是一个预期的结果。但是，我无法解释为什么......

在 Core i5 M520、Ubuntu 18.04 上使用 gcc7.3.0 和 openBlas 编译

【讨论】：

明智的做法是对结果（vf_res 和 mul_res）进行一些处理，例如比较它们，以确保它们没有被优化掉。 Claes，谢谢您的回复。我尝试了您的代码，并注意到在两种情况下（一个或两个循环）我都有相同的时间。但是这个时间是这样的：for loop time 2073 ms arma time 2060 msec 正如我所看到的，现在时间是相同的。我注意到将向量初始化远离循环会改变性能。 @dmuir，我添加了使用结果的代码，以确保它们没有被优化掉（实际上在发布的代码中，您可以找到 for loop with 'complex s = vf_res[vec_idx];' . 它不影响性能。 @dmuir 和 vladimir，你是对的，这绝对是关于优化的事情。当我使用 -O2 标志运行您的代码时，我得到 ~3000/4000，使用 -O0 ~11000/10000 和 -Ofast ~100/1000。顺便说一句，我不使用包装库 -DARMA_DONT_USE_WRAPPER。

以上是关于犰狳 vs for 循环向量乘法的主要内容，如果未能解决你的问题，请参考以下文章