Intrinsics 与 Naive Vector 减少结果的差异

Posted 2023-02-16

技术标签:

【中文标题】Intrinsics 与 Naive Vector 减少结果的差异【英文标题】：Discrepancy in result of Intrinsics vs Naive Vector reduction 【发布时间】：2021-12-30 14:13:40 【问题描述】：

我一直在比较 Intrinsics 向量缩减、朴素向量缩减和使用 openmp pragma 的向量缩减的运行时间。但是，我发现在这些情况下结果是不同的。代码如下——（内在向量缩减取自——Fastest way to do horizontal SSE vector sum (or other reduction)）

#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
#include <algorithm>
#include <immintrin.h>


inline float hsum_ps_sse3(__m128 v) 
    __m128 shuf = _mm_movehdup_ps(v);        // broadcast elements 3,1 to 2,0
    __m128 sums = _mm_add_ps(v, shuf);
    shuf        = _mm_movehl_ps(shuf, sums); // high half -> low half
    sums        = _mm_add_ss(sums, shuf);
    return        _mm_cvtss_f32(sums);



float hsum256_ps_avx(__m256 v) 
    __m128 vlow  = _mm256_castps256_ps128(v);
    __m128 vhigh = _mm256_extractf128_ps(v, 1); // high 128
           vlow  = _mm_add_ps(vlow, vhigh);     // add the low 128
    return hsum_ps_sse3(vlow);         // and inline the sse3 version, which is optimal for AVX
    // (no wasted instructions, and all of them are the 4B minimum)


void reduceVector_Naive(std::vector<float> values)
    float result = 0;
    for(int i=0; i<int(1e8); i++)
        result  += values.at(i);
    
    printf("Reduction Naive = %f \n", result);



void reduceVector_openmp(std::vector<float> values)
    float result = 0;
    #pragma omp simd reduction(+: result)
    for(int i=0; i<int(1e8); i++)
        result  += values.at(i);
    

    printf("Reduction OpenMP = %f \n", result);


void reduceVector_intrinsics(std::vector<float> values)
    float result = 0;
    float* data_ptr = values.data();

    for(int i=0; i<1e8; i+=8)
        result  += hsum256_ps_avx(_mm256_loadu_ps(data_ptr + i));
    

    printf("Reduction Intrinsics = %f \n", result);



int main()

    std::vector<float> values;

    for(int i=0; i<1e8; i++)
        values.push_back(1);
    


    reduceVector_Naive(values);
    reduceVector_openmp(values);
    reduceVector_intrinsics(values);

// The result should be 1e8 in each case

但是，我的输出如下-

Reduction Naive = 16777216.000000 
Reduction OpenMP = 16777216.000000 
Reduction Intrinsics = 100000000.000000

可以看出，只有内函数才能正确计算它，而其他函数则面临精度问题。我完全了解由于四舍五入而使用浮点数可能面临的精度问题，所以我的问题是，为什么内在函数得到正确的答案，即使它实际上也是浮点值算术。

我将其编译为 - g++ -mavx2 -march=native -O3 -fopenmp main.cpp 尝试使用版本7.5.0 以及10.3.0

TIA

【问题讨论】：

SIMD 减少有效地使用了多个累加器（特别是如果你使用多个向量，你应该隐藏 FP 添加延迟），所以它是a step in the direction of pairwise summation。或者至少如果你没有通过减少到循环内的标量来破坏目的！正如我在您链接的 hsum Q&A 中提到的（在关于数组点积的要点中），最后做一次水平的东西； _mm256_add_ps 在循环中。如果 OpenMP 已正确自动矢量化，您会从中看到 100000000.0。显然，带有边界检查的 .at(i) 而不是 [i]，正在击败 OpenMP 自动矢量化。（如果你使用-ffast-math - godbolt.org/z/EeaWYPsYo，那太天真了）。不过，废话，OpenMP 使用 vgatherqps 而不是连续加载来编写疯狂的代码。 godbolt.org/z/Y8cG6dozo 显示了合理的 OpenMP 矢量化，在循环内索引 const float * 而不是调用 std::vector 的重载 operator[] 并且看不到生成的访问位置是连续的。不幸的是，即使使用 OpenMP，GCC 仍然没有使用多个累加器来隐藏 FP 延迟，将其限制为每 4 个周期 32 个字节。 clang 做得更好，尽管如果它只使用 4 个向量，展开更多似乎是合理的。 godbolt.org/z/ssoqas575 【参考方案1】：

Naïve 循环按 1.0 添加，并在 16777216.000000 停止添加，因为 binary32 浮点数中没有足够的有效数字。

查看此问答：Why does a float variable stop incrementing at 16777216 in C#?

当你添加计算的水平和时，它会加上8.0，所以从它停止添加的数字大约是16777216*8 = 134217728，你只是在你的实验中没有达到它。

【讨论】：

以上是关于Intrinsics 与 Naive Vector 减少结果的差异的主要内容，如果未能解决你的问题，请参考以下文章