数组乘法与 sse 内在函数乘法的时序？

Posted 2023-02-16

技术标签:

【中文标题】数组乘法与 sse 内在函数乘法的时序？【英文标题】：Timing of array multiplication vs. sse intrinsics multiplication? 【发布时间】：2014-10-21 19:31:37 【问题描述】：

我创建了以下代码以测试我对 sse 内在函数的理解。代码可以正确编译和运行，但 sse 的改进不是很显着。使用 sse 内在函数大约是。快 20%。它不应该快大约 4 倍或速度提高 400% 吗？编译器是否优化了标量循环？如果是这样，如何禁用它？我写的 sse_mult() 函数有问题吗？

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <emmintrin.h>
// gcc options -mfpmath=sse -mmmx -msse -msse2 \ Not sure if any are needed have been using -msse2

/*--------------------------------------------------------------------------------------------------
 * SIMD intrinsics header files
 * 
 * <mmintrin.h>  MMX
 *
 * <xmmintrin.h> SSE
 *
 * <emmintrin.h> SSE2
 *
 * <pmmintrin.h> SSE3
 *
 * <tmmintrin.h> SSE3
 *
 * <smmintrin.h> SSE4.1
 *
 * <nmmintrin.h> SSE4.2
 *
 * <ammintrin.h> SSE4A
 *
 * <wmmintrin.h> AES
 *
 * <immintrin.h> AVX
 *------------------------------------------------------------------------------------------------*/

#define n 1000000

// Global variables
float a[n]; // array to hold random numbers
float b[n]; // array to hold random numbers
float c[n]; // array to hold product a*b for scalar multiply
__declspec(align(16)) float d[n] ; // array to hold product a*b for sse multiply
// Also possible to use __attribute__((aligned(16))); to force correct alignment

// Multiply using loop
void loop_mult() 
    int i; // Loop index

    clock_t begin_loop, end_loop; // clock_t is type returned by clock()
    double time_spent_loop;

    // Time multiply operation
    begin_loop = clock();   
        // Multiply two arrays of doubles
        for(i = 0; i < n; i++) 
            c[i] = a[i] * b[i];
        
    end_loop = clock();

    // Calculate time it took to run loop. Type int CLOCK_PER_SEC is # of clock ticks per second.
    time_spent_loop = (double)(end_loop - begin_loop) / CLOCKS_PER_SEC;
    printf("Time for scalar loop was %f seconds\n", time_spent_loop);


// Multiply using sse
void sse_mult() 
    int k,i; // Index
    clock_t begin_sse, end_sse; // clock_t is type returned by clock()
    double time_spent_sse;

    // Time multiply operation
    begin_sse = clock();    
        // Multiply two arrays of doubles
        __m128 x,y,result; // __m128 is a data type, can hold 4 32 bit floating point values
        result = _mm_setzero_ps(); // set register to hold all zeros
        for(k = 0; k <= (n-4); k += 4) 
            x = _mm_load_ps(&a[k]); // Load chunk of 4 floats into register
            y = _mm_load_ps(&b[k]);
            result = _mm_mul_ps(x,y); // multiply 4 floats
            _mm_store_ps(&d[k],result); // store result in array
        
        int extra = n%4; // If array size isn't exactly a multiple of 4 use scalar ops for remainder
        if(extra!=0) 
            for(i = (n-extra); i < n; i++) 
                d[i] = a[i] * b[i];
            
        
    end_sse = clock();

    // Calculate time it took to run loop. Type int CLOCK_PER_SEC is # of clock ticks per second.
    time_spent_sse = (double)(end_sse - begin_sse) / CLOCKS_PER_SEC;
    printf("Time for sse was %f seconds\n", time_spent_sse);


int main() 
    int i; // Loop index

    srand((unsigned)time(NULL)); // initial value that rand uses, called the seed
        // unsigned garauntees positive values
        // time(NULL) uses the system clock as the seed so values will be different each time

    for(i = 0; i < n; i++) 
        // Fill arrays with random numbers
        a[i] = ((float)rand()/RAND_MAX)*10; // rand() returns an integer value between 0 and RAND_MAX
        b[i] = ((float)rand()/RAND_MAX)*20;
    

    loop_mult();
    sse_mult();
    for(i=0; i<n; i++) 
        // printf("a[%d] = %f\n", i, a[i]); // print values to check
        // printf("b[%d] = %f\n", i, b[i]);
        // printf("c[%d] = %f\n", i, c[i]);
        // printf("d[%d] = %f\n", i, d[i]);
        if(c[i]!=d[i]) 
            printf("Error with sse multiply.\n");
            break;
        
    


    return 0;

【问题讨论】：

将 n 设置为 2048 而不是 1000000。使用 -funroll-loops 打开循环展开。多次重复循环，然后将时间除以重复值。查看***.com/questions/25774190/… 以获取z[i] = x[i] + y[i] 的示例我在问题中没有看到-O2 或其他优化标志？ 【参考方案1】：

您的程序受内存限制。 SSE 并没有太大的不同，因为大部分时间都花在了从 RAM 中读取那些大数组上。减小这些数组的大小，使其适合缓存。而是增加通过次数。当所有数据都已经在缓存中时，SSE 版本的执行速度应该会明显加快。

请记住，可能还涉及其他因素：

GCC 可以（在某种程度上）自动矢量化循环。（我认为它需要 -O3）您测试的第一种方法会比较慢，因为缓存尚未填充。您可能希望多次交替运行这两种方法。

【讨论】：

我建议的另一件事是使用 -funroll-loops 打开循环展开。增加通行数是什么意思？创建两个单独的程序，一个用于标量循环，一个用于 sse 会更好吗？ @biononic 不是处理大数组，而是处理较小的数组，但在循环中执行多次。这只是减少缓存未命中影响的方法之一。两个程序应该可以帮助您将这两种情况相互隔离，但这只会对第 2 点有所帮助。谢谢！将数组大小设置为 2048 然后循环函数调用时，我越来越接近预期结果。传球次数越多，改进就越明显。

以上是关于数组乘法与 sse 内在函数乘法的时序？的主要内容，如果未能解决你的问题，请参考以下文章

verilog中时序逻辑和非阻塞赋值的内在联系是啥? 都说时序逻辑用非阻塞赋值,这是啥决定的?

将 SSE 矩阵向量乘法代码转换为 AVX