SSE中的浮点到uchar转换问题

Posted

技术标签:

【中文标题】SSE中的浮点到uchar转换问题【英文标题】:floating point to uchar conversion issue in SSE 【发布时间】:2014-10-16 14:03:46 【问题描述】:

嗨, 关于我之前的帖子,我解决了 SSE 中的比较操作。 但是在得到输出后,我观察到我的输出是浮点数,而我的预期输出是 uchar 。 例如,我期望输出为 8,输出为 floatng 格式,为 8.0(32 位浮点格式)。将该值转换为 1 字节无符号的值后,这与 8 大不相同 .. PFB 我在 C 中的原始代码及其在 SSE 中的相应代码:

C 代码:

unsigned char *destination_buff = (unsigned char *)malloc(sizeof(unsigned char)*height*width);
float *d1 = inputbuffer;
float *d2 = d1 + width;
float *d3 = d2 + width;

for(int i=1;i<height;i++)

    for(int j=1;j<width;j++)
    
       int val = d2[j];
       int temp1 = 0x00FF;
       int temp2 = 0;   
       if(val <= d1[j-1]) temp2 += 0x80;
       if(val <= d1[j])   temp2 += 0x40;
       if(val <= d1[j+1]) temp2 += 0x20;    
       if(val <= d2[j-1]) temp2 += 0x10;
       if(val <= d2[j+1]) temp2 += 0x08;
       if(val <= d3[j-1]) temp2 += 0x04;
       if(val <= d3[j])   temp2 += 0x02;
       if(val <= d3[j+1]) temp2 ++;    
       temp1 &= (~temp2);
       destination_buff[j-1] = temp1;       
    
        d1 += width;
        d2 += width;
        d3 += width;

        destination_buff += (width);
   

这是我的 SSE 代码:

float *destination_buff = (float *)malloc(sizeof(float)*height*width);

uchar *dst_d = outputbuffer; //Pointer to the destination buffer which is already present and need to fill the output data in this
float *CT_image_0 = m_dat;
float *CT_image_1 = CT_image_0 + width;
float *CT_image_2 = CT_image_1 + width;

for(int i=1;i<height;++i)

    for(int j=1;j<width;j+=4)
    
      __m128 CT_current_00 = _mm_loadu_ps((CT_image_0+j-1));
      __m128 CT_current_10 = _mm_loadu_ps((CT_image_1+j-1));
      __m128 CT_current_20 = _mm_loadu_ps((CT_image_2+j-1));

      __m128 CT_current_01 = _mm_loadu_ps(((CT_image_0+1)+j-1));
      __m128 CT_current_11 = _mm_loadu_ps(((CT_image_1+1)+j-1));
      __m128 CT_current_21 = _mm_loadu_ps(((CT_image_2+1)+j-1));

      __m128 CT_current_02 = _mm_loadu_ps(((CT_image_0+2)+j-1));
      __m128 CT_current_12 = _mm_loadu_ps(((CT_image_1+2)+j-1));
      __m128 CT_current_22 = _mm_loadu_ps(((CT_image_2+2)+j-1));

      __m128 val    =  CT_current_11;

      __m128 t1 = _mm_set1_ps(0x80);
      __m128 t2 = _mm_set1_ps(0x40);
      __m128 t3 = _mm_set1_ps(0x20);
      __m128 t4 = _mm_set1_ps(0x10);
      __m128 t5 = _mm_set1_ps(0x08);
      __m128 t6 = _mm_set1_ps(0x04);
      __m128 t7 = _mm_set1_ps(0x02);
      __m128 t8 = _mm_set1_ps(0x01);

      __m128 out = _mm_setzero_ps();                 // init output flags to all zeroes


      __m128 sample = _mm_cmple_ps(val,CT_current_00);
             sample = _mm_and_ps(sample,t1);
               out  = _mm_or_ps(out,sample);
             sample = _mm_cmple_ps(val,CT_current_01);
             sample = _mm_and_ps(sample,t2);
               out  = _mm_or_ps(out,sample);
            sample = _mm_cmple_ps(val,CT_current_02);
            sample = _mm_and_ps(sample,t3);
              out  = _mm_or_ps(out,sample);

            sample = _mm_cmple_ps(val,CT_current_10);
            sample = _mm_and_ps(sample,t4);
              out  = _mm_or_ps(out,sample);
            sample = _mm_cmple_ps(val,CT_current_12);
            sample = _mm_and_ps(sample,t5);
              out  = _mm_or_ps(out,sample);

            sample = _mm_cmple_ps(val,CT_current_20);
            sample = _mm_and_ps(sample,t6);
              out  = _mm_or_ps(out,sample);
            sample = _mm_cmple_ps(val,CT_current_21);
            sample = _mm_and_ps(sample,t7);
              out  = _mm_or_ps(out,sample);
            sample = _mm_cmple_ps(val,CT_current_22);
            sample = _mm_and_ps(sample,t8);
              out  = _mm_or_ps(out,sample);

            _mm_storeu_ps((destination_buff+(j-1)),out);
            dst_d =  (uchar *)destination_buff;

        

    CT_image_0  += width;
    CT_image_1  += width;
    CT_image_2  += width;

    dst_d += (width);


所有存储操作都在 float 和 __m128i 上。我如何将结果存储到 uchar 中??

【问题讨论】:

每次迭代都会产生 4 个结果,因此您需要做的是每 4 次迭代将 4 x 4 结果打包到单个 16 x 8 位向量中,并将其存储在 _mm_storeu_si128 中。或者,只需在每次迭代时从每个浮点向量中提取 4 个字节,然后使用标量代码存储这些字节。 是否可以直接将“out”变量移位以从中提取4字节,并且“out”变量的四个字节与“输出”不完全相同......因为它是浮点变量跨度> 我在上面的评论中提到的两种方法中的任何一种都应该没问题 - 不幸的是我现在不必写一个完整的答案,但我明天再看看如果你仍然卡住。 请修正代码。您有两行 (1) if(val &lt;= d1[j+1]) temp2 += 0x20; 和 (2) if(val &lt;= d2[j-1]) temp2 += 0x10; 关闭 for 循环,留下后续 关闭一些其他未显示的构造。 you don't need to cast the result of malloc in C 【参考方案1】:

您可以进行打包比较来获取掩码,然后将该掩码与整数操作一起使用。 _mm_set1_ps(0x80) 表明你正在做一些奇怪的事情。您可能不应该将二次幂位掩码转换为浮点数,因为将它们与_mm_add_ps 相结合比将它们与_mm_or_si128 结合要慢得多。

对于某些偏移负载,您可能还最好使用palignr,以平衡负载端口和 ALU 端口之间的代码。

【讨论】:

以上是关于SSE中的浮点到uchar转换问题的主要内容,如果未能解决你的问题,请参考以下文章

SSE 的整数/浮点值

AVX/SSE 将浮点符号掩码转换为 __m128i

使用 SSE 将浮点值从 Assembler DLL 返回到 C++

sse2浮点乘法

浮点向量的 SSE 缩减

在汇编中将无符号字符转换为浮点数(为浮点向量计算做准备)