如何将两个_pd 转换为一个_ps？

Posted 2023-02-16

技术标签:

【中文标题】如何将两个_pd 转换为一个_ps？【英文标题】：How to convert two _pd into one _ps? 【发布时间】：2019-02-04 13:54:38 【问题描述】：

我正在循环一些数据，计算一些 double 和每 2 个 __m128d 操作，我想将数据存储在 __m128 浮点数上。

所以 64+64 + 64+64 (2 __m128d) 存储到 1 32+32+32+32 __m128。

我做这样的事情：

__m128d v_result;
__m128 v_result_float;

...

// some operations on v_result

// store the first two "slot" on float
v_result_float = _mm_cvtpd_ps(v_result);

// some operations on v_result
// I need to store the last two "slot" on float
v_result_float = _mm_cvtpd_ps(v_result); ?!?

但它每次都会覆盖（显然）前 2 个浮动“插槽”。

如何“空格”_mm_cvtpd_ps 以第二次开始将值插入 3° 和 4°“槽”？

完整代码如下：

__m128d v_pA;
__m128d v_pB;
__m128d v_result;
__m128 v_result_float;

float *pCEnd = pTest + roundintup8(blockSize);
for (; pTest < pCEnd; pA += 8, pB += 8, pTest += 8) 
    v_pA = _mm_load_pd(pA);
    v_pB = _mm_load_pd(pB);
    v_result = _mm_add_pd(v_pA, v_pB);
    v_result = _mm_max_pd(v_boundLower, v_result);
    v_result = _mm_min_pd(v_boundUpper, v_result);
    v_result = _mm_mul_pd(v_rangeLn2per12, v_result);
    v_result = _mm_add_pd(v_minLn2per12, v_result);

    // two double processed: store in 1° and 2° float slot
    v_result_float = _mm_cvtpd_ps(v_result);

    v_pA = _mm_load_pd(pA + 2);
    v_pB = _mm_load_pd(pB + 2);
    v_result = _mm_add_pd(v_pA, v_pB);
    v_result = _mm_max_pd(v_boundLower, v_result);
    v_result = _mm_min_pd(v_boundUpper, v_result);
    v_result = _mm_mul_pd(v_rangeLn2per12, v_result);
    v_result = _mm_add_pd(v_minLn2per12, v_result);

    // another two double processed: store in 3° and 4° float slot
    v_result_float = _mm_cvtpd_ps(v_result); // fail
    v_result_float = someFunction(v_result_float);
    _mm_store_ps(pTest, v_result_float);

    v_pA = _mm_load_pd(pA + 4);
    v_pB = _mm_load_pd(pB + 4);
    v_result = _mm_add_pd(v_pA, v_pB);
    v_result = _mm_max_pd(v_boundLower, v_result);
    v_result = _mm_min_pd(v_boundUpper, v_result);
    v_result = _mm_mul_pd(v_rangeLn2per12, v_result);
    v_result = _mm_add_pd(v_minLn2per12, v_result);

    // two double processed: store in 1° and 2° float slot
    v_result_float = _mm_cvtpd_ps(v_result);

    v_pA = _mm_load_pd(pA + 6);
    v_pB = _mm_load_pd(pB + 6);
    v_result = _mm_add_pd(v_pA, v_pB);
    v_result = _mm_max_pd(v_boundLower, v_result);
    v_result = _mm_min_pd(v_boundUpper, v_result);
    v_result = _mm_mul_pd(v_rangeLn2per12, v_result);
    v_result = _mm_add_pd(v_minLn2per12, v_result);

    // another two double processed: store in 3° and 4° float slot
    v_result_float = _mm_cvtpd_ps(v_result); // fail
    v_result_float = someFunction(v_result_float);      
    _mm_store_ps(pTest + 4, v_result_float);

【问题讨论】：

v_result_float = _mm_movelh_ps(_mm_cvtpd_ps(v_result1), _mm_cvtpd_ps(v_result2)); @chtz 这将引入一个“新”寄存器。如果我只有v_result怎么办？您确定在您想加入他们的地方缺少注册吗？ _mm_cvtpd_ps 可以就地发生，如果之后不需要 double 值（如果有意义，编译器会这样做） True ;) 如果您想要接受的答案，请回答 ;) 【参考方案1】：

您需要使用movlhps (_mm_movelh_ps) 将第二次转换的低位字移动到第一次转换结果的高位字。简化示例：

#include <immintrin.h>

__m128d some_double_operation(__m128d);
__m128 some_float_operation(__m128);

void foo(double const* input, float* output, int size)

    // assuming everything is already nicely aligned ...
    for(int i=0; i<size; i+=4, input+=4, output+=4)
    
        __m128d res_lo = some_double_operation(_mm_load_pd(input));
        __m128d res_hi = some_double_operation(_mm_load_pd(input+2));
        __m128 res_float = _mm_movelh_ps(_mm_cvtpd_ps(res_lo), _mm_cvtpd_ps(res_hi));
        __m128 res_final = some_float_operation(res_float);
        _mm_store_ps(output, res_final);

Godbolt 演示：https://godbolt.org/z/wgKjxN.

如果 some_double_operation 是内联的，编译器可能会将第一次双精度操作的结果保存在第二次调用函数未使用的寄存器中，因此不需要将任何内容存储到内存中。

【讨论】：

我可能写了一个 __m128 packpd_ps(__m128d low, __m128d high) 函数，它只是包装了 2x convert + shuffle，这与像 _mm_packs_epi16 这样的整数打包操作完全相同（包括从 @ 到 +-Infinity 的 FP 饱和） 987654328@)。也许还有一个__m128 load_convertpd_ps(const double *) 包装器。但是，是的，movlhps 是正确的指令，比 unpcklpd 更短的机器代码执行相同的洗牌。

以上是关于如何将两个_pd 转换为一个_ps？的主要内容，如果未能解决你的问题，请参考以下文章

四品达通用权限系统__pd-tools-dozer(对象属性转换)和pd-tools-validator(后端表单数据验证)

AVX/SSE 将浮点符号掩码转换为 __m128i

如何在 MSVC 中高效地将两个 __m128d 转换为一个 __m128i？

如何将数组从两个不同的表列转换为平行行？