SSE2 向量移位

Posted 2023-02-16

技术标签:

【中文标题】SSE2 向量移位【英文标题】：SSE2 shift by vector 【发布时间】：2016-07-27 06:38:22 【问题描述】：

我一直在尝试在 SSE2 内在函数中实现按向量移位，但从实验和the intel intrinsic guide 来看，它似乎只使用了向量中最不重要的部分。

为了改写我的问题，给定一个向量 v1, v2, ..., vn 和一组移位 s1, s2, ..., sn，我如何计算结果 r1, r2, ..., rn 这样：

r1 = v1 << s1
r2 = v2 << s2
...
rn = vn << sn

因为似乎 _mm_sll_epi* 执行此操作：

r1 = v1 << s1
r2 = v2 << s1
...
rn = vn << s1

提前致谢。

编辑：

这是我的代码：

#include <iostream>

#include <cstdint>

#include <mmintrin.h>
#include <emmintrin.h>

namespace SIMD 

    using namespace std;

    class SSE2 
    public:
        // flipped operands due to function arguments
        SSE2(uint64_t a, uint64_t b, uint64_t c, uint64_t d)  low = _mm_set_epi64x(b, a); high = _mm_set_epi64x(d, c); 

        uint64_t& operator[](int idx)
        
            switch (idx) 
            case 0:
                _mm_storel_epi64((__m128i*)result, low);
                return result[0];
            case 1:
                _mm_store_si128((__m128i*)result, low);
                return result[1];
            case 2:
                _mm_storel_epi64((__m128i*)result, high);
                return result[0];
            case 3:
                _mm_store_si128((__m128i*)result, high);
                return result[1];
            

            /* Undefined behaviour */
            return 0;
        

        SSE2& operator<<=(const SSE2& rhs)
        
            low  = _mm_sll_epi64(low,  rhs.getlow());
            high = _mm_sll_epi64(high, rhs.gethigh());

            return *this;
        

        void print()
        
            uint64_t a[2];
            _mm_store_si128((__m128i*)a, low);

            cout << hex;
            cout << a[0] << ' ' << a[1] << ' ';

            _mm_store_si128((__m128i*)a, high);

            cout << a[0] << ' ' << a[1] << ' ';
            cout << dec;
        

        __m128i getlow() const
        
            return low;
        

        __m128i gethigh() const
        
            return high;
        
    private:
        __m128i low, high;
        uint64_t result[2];
    ;


int main()

    cout << "operator<<= test: vector << vector: ";
    
        auto x = SIMD::SSE2(7, 8, 15, 10);
        auto y = SIMD::SSE2(4, 5,  6,  7);

        x.print();
        y.print();

        x <<= y;

        if (x[0] != 112 || x[1] != 256 || x[2] != 960 || x[3] != 1280) 
            cout << "FAILED: ";
            x.print();
            cout << endl;
         else 
            cout << "PASSED" << endl;
        
    

    return 0;

应该发生的结果是 7

希望这会有所帮助，詹斯。

【问题讨论】：

拜托，拜托，如果你不给我们你的变量声明，别人怎么知道你在做什么？请先阅读顶部菜单中的帮助文本以了解如何提问，然后阅读“c”标签的信息文本（只需单击它）。那就看***.com/help/mcve。 @JensGustedt 我在标记它时打错了，它应该是 c++。但不管怎样，你去吧。可变的元素粒度移位是 AVX2 的一部分，而不是 SSE2。此外，它们仅存在于 32/64 位元素大小。您可以改为通过乘法实现元素粒度左移。 @EOF 不过，我的元素是 64 位的。并且元素粒度的左移对右移没有帮助，我也需要。看来你运气不好。您可能必须将向量元素移动到整数寄存器，将它们移位并将它们移回，或者将它们拆分为单独的向量寄存器并在移位后将它们合并回来。 【参考方案1】：

如果 AVX2 可用，并且您的元素是 32 位或 64 位，则您的操作需要一条可变移位指令：vpsrlvq, (__m128i _mm_srlv_epi64 (__m128i a, __m128i count))

对于 SSE4.1 的 32 位元素，请参阅 Shifting 4 integers right by different values SIMD。根据延迟与吞吐量要求，您可以进行单独的移位移位然后混合，或者使用乘法（通过特殊构造的 2 的幂向量）来获得可变计数左移位，然后执行相同计数-所有元素右移。

对于您的情况，具有运行时可变移位计数的 64 位元素：

每个 SSE 向量只有两个元素，所以我们只需要两次移位，然后组合结果（我们可以使用 pblendw 或浮点 movsd 来完成（这可能会导致额外的旁路延迟延迟）在某些 CPU 上），或者我们可以使用两个 shuffle，或者我们可以执行两个 AND 和一个 OR。

__m128i SSE2_emulated_srlv_epi64(__m128i a, __m128i count)

    __m128i shift_low = _mm_srl_epi64(a, count);          // high 64 is garbage
    __m128i count_high = _mm_unpackhi_epi64(count,count); // broadcast the high element
    __m128i shift_high = _mm_srl_epi64(a, count_high);    // low 64 is garbage
    // SSE4.1:
    // return _mm_blend_epi16(shift_low, shift_high, 0x0F);

#if 1   // use movsd to blend
    __m128d blended = _mm_move_sd( _mm_castsi128_pd(shift_high), _mm_castsi128_pd(shift_low) );  // use movsd as a blend.  Faster than multiple instructions on most CPUs, but probably bad on Nehalem.
    return _mm_castpd_si128(blended);
#else  // SSE2 without using FP instructions:
    // if we're going to do it this way, we could have shuffled the input before shifting.  Probably not helpful though.
    shift_high = _mm_unpackhi_epi64(shift_high, shift_high);       // broadcast the high64
    return       _mm_unpacklo_epi64(shift_high, shift_low);        // combine
#endif

像 pshufd 或 psrldq 之类的其他 shuffle 也可以，但 punpckhqdq 无需立即字节即可完成工作，因此它短了一个字节。 SSSE3 palignr 可以将一个寄存器的高元素和另一个寄存器的低元素放入一个向量中，但它们会被反转（所以我们需要一个pshufd 来交换高半和低半部分）。 shufpd 可以混合，但与 movsd 相比没有优势。

请参阅Agner Fog's microarch guide，了解在两条整数指令之间使用 FP 指令可能导致的旁路延迟延迟的详细信息。在 Intel SnB 系列 CPU 上可能没问题，因为其他 FP shuffle 也是。（是的，movsd xmm1, xmm0 在端口 5 的随机播放单元上运行。如果您不需要合并行为，请使用 movaps 或 movapd 进行 reg-reg 移动，即使是标量）。

这编译（在Godbolt 和 gcc5.3 -O3）到

    movdqa  xmm2, xmm0  # tmp97, a
    psrlq   xmm2, xmm1    # tmp97, count
    punpckhqdq      xmm1, xmm1  # tmp99, count
    psrlq   xmm0, xmm1    # tmp100, tmp99
    movsd   xmm0, xmm2    # tmp102, tmp97
    ret

【讨论】：

以上是关于SSE2 向量移位的主要内容，如果未能解决你的问题，请参考以下文章

r R中的移位向量

使用 SSE2 优化 RGB565 到 RGB888 的转换

处理速度的向量类库

有没有办法用 AVX2 编写 _mm256_shldi_epi8(a,b,1) ？（向量之间每 8 位元素移位一位）

Delphi中的SSE2优化？

SIMD 零向量测试