如何并行化将矩阵的行随机复制到内存中的另一个矩阵的过程？ [复制]

Posted 2023-02-22

技术标签:

【中文标题】如何并行化将矩阵的行随机复制到内存中的另一个矩阵的过程？ [复制]【英文标题】：How can I parallelize the process of copying the rows of a matrix randomly to another matrix in memory? [duplicate] 【发布时间】：2019-12-15 06:48:22 【问题描述】：

我有一个矩阵，称为small_matrix，由大约 100000 行和 128 列存储为单个数组组成（我将它用于 CUDA 计算，因此需要节省空间）。我有一个更大的矩阵，称为large_matrix，行数是small_matrix 的10 倍，行长相同，我想用small_matrix 中的行填充它的行。但是，填充过程不是 1:1。有一个map 数组将large_matrix 中的每一行映射到small_matrix 中的一行。 small_matrix 中的单行可以被large_matrix 中的多行映射。我们可以假设地图数组是随机生成的。此外，large_matrix 中的一行有很小的机会（假设为 1%）将具有随机值而不是实际值。

我正在尝试通过在 C++ 上使用 OMP 来优化此过程，但我似乎无法做到。到目前为止，我尝试过的任何事情都只会导致使用更多线程增加运行时间而不是减少运行时间。这是问题的代码，我正在尝试优化expand_matrix：

#include <stdio.h>
#include <omp.h>
#include <random>
#include <stdlib.h>
#include <cstddef>
#include <ctime>
#include <cstring>
using namespace std;

inline void* aligned_malloc(size_t size, size_t align)
    void *result;
    #ifdef _MSC_VER 
    result = _aligned_malloc(size, align);
    #else 
     if(posix_memalign(&result, align, size)) result = 0;
    #endif
    return result;

inline void aligned_free(void *ptr) 
    #ifdef _MSC_VER 
        _aligned_free(ptr);
    #else 
      free(ptr);
    #endif



void expand_matrix(int num_rows_in_large_matrix, int row_length, long long* map,  float*small_matrix, float* large_matrix, const int num_threads);


int main()
    int row_length = 128;
    long long small_matrix_rows = 100000;
    long long large_matrix_rows = 1000000;
    long long *map = new long long [large_matrix_rows]; 
    float *small_matrix = (float*)aligned_malloc(small_matrix_rows*128*sizeof(float), 128);
    float *large_matrix = (float*)aligned_malloc(large_matrix_rows*128*sizeof(float), 128);

    minstd_rand gen(std::random_device()); //NOTE: Valgrind will give an error saying: vex amd64->IR: unhandled instruction bytes: 0xF 0xC7 0xF0 0x89 0x6 0xF 0x42 0xC1 :: look: https://bugs.launchpad.net/ubuntu/+source/valgrind/+bug/
    uniform_real_distribution<double> values_dist(0, 1);
    uniform_int_distribution<long long> map_dist(0,small_matrix_rows);
    for (long long i = 0; i<small_matrix_rows*row_length;i++)
        small_matrix[i] = values_dist(gen)-0.5;
    
    for (long long i=0; i<large_matrix_rows;i++)
        if (values_dist(gen)<0.99)
            map[i] = map_dist(gen);
    
    clock_t start, end;
    int num_threads =4;
    printf("Populated matrix and generated map\n");
    start = clock();
    expand_matrix(large_matrix_rows, row_length, map, small_matrix, large_matrix, num_threads);
    end = clock();
    printf("Time to expand using %d threads = %f\n", num_threads, double(end-start)/CLOCKS_PER_SEC);
    return 0;




void expand_matrix(int num_rows_in_large_matrix, int row_length, long long* map,  float*small_matrix, float* large_matrix, const int num_threads)

  #pragma omp parallel num_threads(num_threads)
  
    #pragma omp for schedule(guided, 4) 
    for(unsigned int i = 0; i < num_rows_in_large_matrix; i++ )
      long long sml = map[i];
      if(sml == -1)
        for (int j = 0; j < row_length; j++)
          large_matrix[i * row_length + j] = 0.5;
      
      else
        memcpy(large_matrix+i*row_length, small_matrix+sml*row_length, row_length*sizeof(float));

以下是一些运行时：

Time to expand using 1 threads = 0.402949
Time to expand using 2 threads = 0.530361
Time to expand using 4 threads = 0.608085
Time to expand using 8 threads = 0.667806
Time to expand using 16 threads = 0.999886

我已确保矩阵与内存对齐，我尝试使用非临时指令进行复制，但我很难过。我不知道该去哪里找了。非常感谢任何帮助。

一些硬件信息：

CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K

使用 Ubuntu 16.04 和 gcc 版本 5.5.0 20171010 (Ubuntu 5.5.0-12ubuntu1~16.04)。

【问题讨论】：

是数据复制。你认为它可以如何并行化？您可以并行计算和管理数据的去向，但您认为您的 CPU 在纯数据复制中实际使用了多少？它主要受限于 CPU 之间共享的 RAM 性能。尝试使用omp_get_wtime() 代替clock() 并告诉我们。 clock() 测量 CPU 时间，随着您添加的 CPU 数量的增加而增加。 omp_get_wtime() 测量挂钟时间，这是您希望看到的减少时间仅供参考：SO: Multi-threading benchmarking issues. 当然可以，我很好 【参考方案1】：

感谢@Gilles 和@Zulan 指出错误。我会将其发布为答案，以便其他人可以看到该问题。我使用了错误的时间测量方法；我的方法不适用于多线程应用程序。换句话说，我误用了clock() 函数。这是@Giller 的回答：

clock() 测量 CPU 时间，该时间随着您添加的 CPU 数量的增加而增加。 omp_get_wtime() 测量挂钟时间，这是您希望看到的减少时间

我用来测量函数执行时间的函数是clock()。此函数计算运行代码所涉及的所有处理器所占用的处理器节拍数。当我使用多个处理器并行运行我的代码时，clock() 返回的时钟滴答是所有处理器的总数，因此随着处理器数量的增加，这个数字只会增加。当我将时间测量切换到omp_get_wtime() 时，返回的时间是正确的，我得到了以下结果：

1 thread = 0.423516
4 threads = 0.152680 
8 threads = 0.090841
16 threads = 0.064748

所以，不要像这样测量运行时间：

    clock_t start, end;
    start = clock();
    expand_matrix(large_matrix_rows, row_length, map, small_matrix, large_matrix, num_threads);
    end = clock();
    printf("Total time %f\n", double(end-start)/CLOCKS_PER_SEC);

我是这样做的：

    double start, end;
    start = omp_get_wtime();
    expand_matrix(large_matrix_rows, row_length, map, small_matrix, large_matrix, num_threads);
    end = omp_get_wtime();
    printf("Total time %f\n", end-start);

【讨论】：

以上是关于如何并行化将矩阵的行随机复制到内存中的另一个矩阵的过程？ [复制]的主要内容，如果未能解决你的问题，请参考以下文章