如何降低C中双精度？

Posted 2021-04-20

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了如何降低C中双精度？相关的知识，希望对你有一定的参考价值。

我试图降低C中double变量的精度来测试对结果的影响。我尝试做一个按位&，但它给出了一个错误。

我如何在float和double变量上执行此操作？

答案

如果您希望应用按位和&，则需要将其应用于float值的整数表示：

float f = 0.1f;
printf("Befor: %a %.16e
", f, f);
unsigned int i;
_Static_assert(sizeof f == sizeof i, "pick integer type of the correct size");
memcpy(&i, &f, sizeof i);
i &= ~ 0x3U; // or any other mask.
            // This one assumes the endianness of floats is identical to integers'
memcpy(&f, &i, sizeof f);
printf("After: %a %.16e
", f, f);

请注意，这不会为您提供类似29位IEEE-754的数字。 f中的值首先被舍入为32位单精度数，然后被残酷地截断。

更优雅的方法依赖于设置了两位的浮点常量：

float f = 0.1f;
float factor = 5.0f; // or 3, or 9, or 17
float c = factor * f;
f = c - (c - f);
printf("After: %a %.16e
", f, f);

这种方法的优点是它使用N位有效数将f舍入到最接近的值，而不是像第一种方法那样将其截断为零。但是，该程序仍在使用32位IEEE 754浮点进行计算，然后舍入到较少的位，因此结果仍然不等于较窄的浮点类型产生的结果。

第二种方法依赖于Dekker的想法，在this article在线描述。

另一答案

如何降低C中双精度？

为了降低浮点数的相对精度，使得significand/mantissa的各个最低有效位为零，代码需要访问有效数。

使用frexp()提取FP编号的符号和指数。

使用ldexp()缩放signicand，然后根据编码目标舍入，截断或覆盖 - 以消除精度。显示截断，但我建议通过rint()进行舍入

缩小并添加指数。

#include <math.h>
#include <stdio.h>

double reduce(double x, int precision_power_2) {
  if (isfinite(x)) {
    int power_2;

    // The frexp functions break a floating-point number into a 
    // normalized fraction and an integral power of 2.
    double normalized_fraction = frexp(x, &power_2);  // 0.5 <= result < 1.0 or 0

    // The ldexp functions multiply a floating-point number by an integral power of 2
   double less_precise = trunc(ldexp(normalized_fraction, precision_power_2));
   x = ldexp(less_precise, power_2 - precision_power_2);

  }
  return x;
}

void testr(double x, int pow2) {
  printf("reduce(%a, %d --> %a
", x, pow2, reduce(x, pow2));
}

int main(void) {
  testr(0.1, 5);
  return 0;
}

产量

//       v-53 bin.digs-v             v-v 5 significant binary digits  
reduce(0x1.999999999999ap-4, 5 --> 0x1.9p-4

使用frexpf()，ldexp()，rintf()，truncf()，floorf()等为float。

以上是关于如何降低C中双精度？的主要内容，如果未能解决你的问题，请参考以下文章

c语言中双精度浮点数（即double类型数据）的取值范围

C 语言中双精度浮点型精度怎样保留位数

如何计算 C++ 中双精度向量的累积和？

如何处理 vb.net 中双精度数的舍入错误？

for循环中双精度数组的索引超出范围[重复]

关于国产麒麟系统中双精度double除法编译优化导商变量不变化(代码调整+volatile) 的解决方法