有没有办法将 8bitX32 ymm 寄存器右/左洗牌 N 个位置（c++）

Posted 2023-02-16

技术标签:

【中文标题】有没有办法将 8bitX32 ymm 寄存器右/左洗牌 N 个位置（c++）【英文标题】：Is there a way to shuffle a 8bitX32 ymm register right/left by N positions (c++) 【发布时间】：2021-02-12 22:17:52 【问题描述】：

就像标题所说的那样，我需要一种方法将 256-avx-register 寄存器中所有元素的位置移动/洗牌 N 个位置。我发现的所有关于这使用 32 或 64 位值 (__builtin_ia32_permvarsf256) 等的帮助将不胜感激。

Example: 2,4,4,2,4,5,0,0,0,0,... shift right by 4 -> 0,0,0,0,2,4,4,2,4,5,...

【问题讨论】：

你需要VPSHUFB instruction（或对应的C++内在函数_mm256_shuffle_epi8）。如果 N 是常数，您还可以查看 _mm256_bslli_epi128/_mm256_bsrli_epi128 和 _mm256_permute2x128_si256+_mm256_or_si256 来组合移位的车道。不容易（直到 AVX-512 或 N 是 4 的倍数时才使用单次随机播放），通常更容易从内存重新加载并让加载执行单元处理未对齐，如果您的数据想要的还在记忆中。感谢您的所有回答，请查看_mm256_shuffle_epi8。 N 不可能是常数。这是我在这方面遇到这么多麻烦的原因之一，到目前为止我发现的所有其他解决方案都要求 N 是一个常数。 (Peter Cordes) 你的意思是我应该把寄存器元素打成一个指针，把它洗牌，然后把它重新加载进去吗？这将是一个可能的解决方案，但我宁愿避免这样做 @lunave：（确保您@要回复的人，以便他们收到通知）。不，我的意思是如果您的原始向量刚刚从内存中加载，请考虑从不同的偏移量进行另一个重叠加载。如果不是，存储然后重新加载几乎没有那么好，需要更多的前端微指令来提高吞吐量。（对于延迟，存储转发停止。）尽管对于非常量字节移位计数实际上值得考虑；这是延迟与吞吐量的权衡，最佳选择取决于周围的代码。 【参考方案1】：

如果在编译时知道移位距离，则相对容易且相当快。唯一需要注意的是，32 字节字节移位指令对 16 字节通道独立执行此操作，因为移位小于 16 字节需要跨通道传播这几个字节。这是左移：

// Move 16-byte vector to higher half of the output, and zero out the lower half
inline __m256i setHigh( __m128i v16 )

    const __m256i v = _mm256_castsi128_si256( v16 );
    return _mm256_permute2x128_si256( v, v, 8 );


template<int i>
inline __m256i shiftLeftBytes( __m256i src )

    static_assert( i >= 0 && i < 32 );
    if constexpr( i == 0 )
        return src;
    if constexpr( i == 16 )
        return setHigh( _mm256_castsi256_si128( src ) );
    if constexpr( 0 == ( i % 8 ) )
    
        // Shifting by multiples of 8 bytes is faster with shuffle + blend
        constexpr int lanes64 = i / 8;
        constexpr int shuffleIndices = ( _MM_SHUFFLE( 3, 2, 1, 0 ) << ( lanes64 * 2 ) ) & 0xFF;
        src = _mm256_permute4x64_epi64( src, shuffleIndices );
        constexpr int blendMask = ( 0xFF << ( lanes64 * 2 ) ) & 0xFF;
        return _mm256_blend_epi32( _mm256_setzero_si256(), src, blendMask );
    
    if constexpr( i > 16 )
    
        // Shifting by more than half of the register
        // Shift low half by ( i - 16 ) bytes to the left, and place into the higher half of the result.
        __m128i low = _mm256_castsi256_si128( src );
        low = _mm_slli_si128( low, i - 16 );
        return setHigh( low );
    
    else
    
        // Shifting by less than half of the register, using vpalignr to shift.
        __m256i low = setHigh( _mm256_castsi256_si128( src ) );
        return _mm256_alignr_epi8( src, low, 16 - i );

但是，如果在编译时不知道移位距离，这是相当棘手的。这是一种方法。它使用了很多 shuffle，但我希望它仍然比两个 32 字节存储（其中之一是写入零）后跟 32 字节加载的明显方式快一些。

// 16 bytes of 0xFF (which makes `vpshufb` output zeros), followed by 16 bytes of identity shuffle [ 0 .. 15 ], followed by another 16 bytes of 0xFF
// That data allows to shift 16-byte vectors by runtime-variable count of bytes in [ -16 .. +16 ] range
inline std::array<uint8_t, 48> makeShuffleConstants()

    std::array<uint8_t, 48> res;
    std::fill_n( res.begin(), 16, 0xFF );
    for( uint8_t i = 0; i < 16; i++ )
        res[ (size_t)16 + i ] = i;
    std::fill_n( res.begin() + 32, 16, 0xFF );
    return res;

// Align by 64 bytes so the complete array stays within cache line
static const alignas( 64 ) std::array<uint8_t, 48> shuffleConstants = makeShuffleConstants();

// Load shuffle constant with offset in bytes. Counterintuitively, positive offset shifts output of to the right.
inline __m128i loadShuffleConstant( int offset )

    assert( offset >= -16 && offset <= 16 );
    return _mm_loadu_si128( ( const __m128i * )( shuffleConstants.data() + 16 + offset ) );


// Move 16-byte vector to higher half of the output, and zero out the lower half
inline __m256i setHigh( __m128i v16 )

    const __m256i v = _mm256_castsi128_si256( v16 );
    return _mm256_permute2x128_si256( v, v, 8 );


inline __m256i shiftLeftBytes( __m256i src, int i )

    assert( i >= 0 && i < 32 );
    if( i >= 16 )
    
        // Shifting by more than half of the register
        // Shift low half by ( i - 16 ) bytes to the left, and place into the higher half of the result.
        __m128i low = _mm256_castsi256_si128( src );
        low = _mm_shuffle_epi8( low, loadShuffleConstant( 16 - i ) );
        return setHigh( low );
    
    else
    
        // Shifting by less than half of the register
        // Just like _mm256_slli_si256, _mm_shuffle_epi8 can't move data across 16-byte lanes, need to propagate shifted bytes manually.
        __m128i low = _mm256_castsi256_si128( src );
        low = _mm_shuffle_epi8( low, loadShuffleConstant( 16 - i ) );
        const __m256i cv = _mm256_broadcastsi128_si256( loadShuffleConstant( -i ) );
        const __m256i high = setHigh( low );
        src = _mm256_shuffle_epi8( src, cv );
        return _mm256_or_si256( high, src );

【讨论】：

第二种方法是我需要的！谢谢..虽然看起来很复杂 比 [两次存储 + 重新加载] 稍快 - 是的延迟，不一定是吞吐量。存储转发停顿会花费大约 10c 的延迟，但如果 OoO exec 可以隐藏它，那么它可能不会花费太多吞吐量。（在阻止其他东西并行运行的意义上，这并不是真正的“停顿”。）请注意，vperm2i128 (_mm256_permute2x128_si256) 可以使用立即数的第 3 位或第 7 位将任一通道归零，因此您可以将其用于常量 i >= 16 情况，以及 256 位 @987654326 @。在除 Zen1 之外的所有 CPU 上，应该比您预期的 vpslldq xmm / vpxor zero / vinsertf128 更好。对于非常量的情况，我想你可以广播加载vpshufb 控制向量，这样洗牌就发生在上车道。（希望_mm256_set1_si128( loadShuffleConstant( 16 - i ) ) 编译为 VBROADCASTI128） @PeterCordes 好主意，已更新。我的 Visual C++ 中没有 _mm256_set1_si128，但我已经确认 _mm256_broadcastsi128_si256 在发布版本中做了正确的事情，即 vbroadcasti128 ymm1,oword ptr [r8] 对此有了另一个想法：对于常量i 的情况，而不是<< / >> / OR，使用vperm2i128 来设置vpalignr。后者具有从相应通道移入字节的非常不便的行为，但是vperm2i128 到通道交换和零创建一个向量，其中包含正确的数据以为每个通道移入（零进入低通道，低通道进入高车道）。

以上是关于有没有办法将 8bitX32 ymm 寄存器右/左洗牌 N 个位置（c++）的主要内容，如果未能解决你的问题，请参考以下文章