为 ARM Thumb2 寻找有效的整数平方根算法

Posted 2023-02-14

技术标签:

【中文标题】为 ARM Thumb2 寻找有效的整数平方根算法【英文标题】：Looking for an efficient integer square root algorithm for ARM Thumb2 【发布时间】：2010-11-09 03:54:27 【问题描述】：

我正在寻找一种快速、仅整数的算法来查找无符号整数的平方根（其整数部分）。代码必须在 ARM Thumb 2 处理器上具有出色的性能。它可以是汇编语言或 C 代码。

欢迎任何提示。

【问题讨论】：

【参考方案1】：

Jack W. Crenshaw 的Integer Square Roots 可以作为另一个参考。

C Snippets Archive 也有一个integer square root implementation。这不仅仅是整数结果，还计算答案的额外小数（定点）位。（更新：不幸的是，C sn-ps 存档现已失效。链接指向页面的 Web 存档。）这是来自 C Snippets 存档的代码：

#define BITSPERLONG 32
#define TOP2BITS(x) ((x & (3L << (BITSPERLONG-2))) >> (BITSPERLONG-2))

struct int_sqrt 
    unsigned sqrt, frac;
;

/* usqrt:
    ENTRY x: unsigned long
    EXIT  returns floor(sqrt(x) * pow(2, BITSPERLONG/2))

    Since the square root never uses more than half the bits
    of the input, we use the other half of the bits to contain
    extra bits of precision after the binary point.

    EXAMPLE
        suppose BITSPERLONG = 32
        then    usqrt(144) = 786432 = 12 * 65536
                usqrt(32) = 370727 = 5.66 * 65536

    NOTES
        (1) change BITSPERLONG to BITSPERLONG/2 if you do not want
            the answer scaled.  Indeed, if you want n bits of
            precision after the binary point, use BITSPERLONG/2+n.
            The code assumes that BITSPERLONG is even.
        (2) This is really better off being written in assembly.
            The line marked below is really a "arithmetic shift left"
            on the double-long value with r in the upper half
            and x in the lower half.  This operation is typically
            expressible in only one or two assembly instructions.
        (3) Unrolling this loop is probably not a bad idea.

    ALGORITHM
        The calculations are the base-two analogue of the square
        root algorithm we all learned in grammar school.  Since we're
        in base 2, there is only one nontrivial trial multiplier.

        Notice that absolutely no multiplications or divisions are performed.
        This means it'll be fast on a wide range of processors.
*/

void usqrt(unsigned long x, struct int_sqrt *q)

      unsigned long a = 0L;                   /* accumulator      */
      unsigned long r = 0L;                   /* remainder        */
      unsigned long e = 0L;                   /* trial product    */

      int i;

      for (i = 0; i < BITSPERLONG; i++)   /* NOTE 1 */
      
            r = (r << 2) + TOP2BITS(x); x <<= 2; /* NOTE 2 */
            a <<= 1;
            e = (a << 1) + 1;
            if (r >= e)
            
                  r -= e;
                  a++;
            
      
      memcpy(q, &a, sizeof(long));

我选择了以下代码。它本质上来自Wikipedia article on square-root computing methods。但是已经改成使用stdint.h类型uint32_t等。严格来说，返回类型可以改成uint16_t。

/**
 * \brief    Fast Square root algorithm
 *
 * Fractional parts of the answer are discarded. That is:
 *      - SquareRoot(3) --> 1
 *      - SquareRoot(4) --> 2
 *      - SquareRoot(5) --> 2
 *      - SquareRoot(8) --> 2
 *      - SquareRoot(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRoot(uint32_t a_nInput)

    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    
        one >>= 2;
    

    while (one != 0)
    
        if (op >= res + one)
        
            op = op - (res + one);
            res = res +  2 * one;
        
        res >>= 1;
        one >>= 2;
    
    return res;

我发现，好的一点是，一个相当简单的修改可以返回“四舍五入”的答案。我发现这在某些应用程序中很有用，可以提高准确性。请注意，在这种情况下，返回类型必须是 uint32_t，因为 2³² - 1 的四舍五入平方根是 2¹⁶。

/**
 * \brief    Fast Square root algorithm, with rounding
 *
 * This does arithmetic rounding of the result. That is, if the real answer
 * would have a fractional part of 0.5 or greater, the result is rounded up to
 * the next integer.
 *      - SquareRootRounded(2) --> 1
 *      - SquareRootRounded(3) --> 2
 *      - SquareRootRounded(4) --> 2
 *      - SquareRootRounded(6) --> 2
 *      - SquareRootRounded(7) --> 3
 *      - SquareRootRounded(8) --> 3
 *      - SquareRootRounded(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRootRounded(uint32_t a_nInput)

    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    
        one >>= 2;
    

    while (one != 0)
    
        if (op >= res + one)
        
            op = op - (res + one);
            res = res +  2 * one;
        
        res >>= 1;
        one >>= 2;
    

    /* Do arithmetic rounding to nearest integer */
    if (op > res)
    
        res++;
    

    return res;

【讨论】：

出于好奇，我将它的 64 位转换与 C 库 sqrt 函数的 static_casting 进行了基准测试，以获得整数结果，我发现它慢了 8.2 倍。 YMMV。更多数据onemanmmo.com/?sqrt @RobertBasler：很高兴你已经测量过了。这种事情是非常特定于硬件的。在您的情况下（在具有浮点硬件的处理器上），当然值得进行比较。我希望这些整数平方根算法对于没有浮点硬件的嵌入式系统会更有用。 IEEE 双精度浮点可以精确地表示高达 ~53 位的整数（尾数的大小），但除此之外，结果不准确。整数 sqrt 的一个优点是它总是给出准确的答案。对于 Cortex M3 和兄弟，第一个循环可以用前导零计数和掩码操作代替：一个 >>= clz(op) & ~0x3;结束大约 30 个循环。【参考方案2】：

如果不需要精确的准确性，我有一个快速的近似值，它使用 260 字节的 RAM（你可以减半，但不要）。

int ftbl[33]=0,1,1,2,2,4,5,8,11,16,22,32,45,64,90,128,181,256,362,512,724,1024,1448,2048,2896,4096,5792,8192,11585,16384,23170,32768,46340;
int ftbl2[32]= 32768,33276,33776,34269,34755,35235,35708,36174,36635,37090,37540,37984,38423,38858,39287,39712,40132,40548,40960,41367,41771,42170,42566,42959,43347,43733,44115,44493,44869,45241,45611,45977;

int fisqrt(int val)

    int cnt=0;
    int t=val;
    while (t) cnt++;t>>=1;
    if (6>=cnt)    t=(val<<(6-cnt));
    else           t=(val>>(cnt-6));

    return (ftbl[cnt]*ftbl2[t&31])>>15;

这是生成表格的代码：

ftbl[0]=0;
for (int i=0;i<32;i++) ftbl[i+1]=sqrt(pow(2.0,i));
printf("int ftbl[33]=0");
for (int i=0;i<32;i++) printf(",%d",ftbl[i+1]);
printf(";\n");

for (int i=0;i<32;i++) ftbl2[i]=sqrt(1.0+i/32.0)*32768;
printf("int ftbl2[32]=");
for (int i=0;i<32;i++) printf("%c%d",(i)?',':' ',ftbl2[i]);
printf(";\n");

在 1 → 2²⁰ 范围内，最大误差为 11，在 1 → 2³⁰ 范围内，最大误差约为 256。您可以使用更大的表和尽量减少这一点。值得一提的是，错误总是负数 - 即当错误时，该值将小于正确值。

您最好在精炼阶段遵循这一点。

这个想法很简单：(ab)^0.5 = a^0.b × b^0.5。

所以，我们取输入 X = A×B 其中 A = 2^N 和 1 ≤ B

然后我们有一个 sqrt(2^N) 的查找表和一个 sqrt(1 ≤ B N) 的查找表存储为整数，这可能是一个错误（测试显示没有不良影响），我们将 sqrt(1 ≤ B

我们知道 1 ≤ sqrt(2^N)

在实现方面，while(t) cnt++;t>>=1; 实际上是一条计数前导位指令 (CLB)，因此，如果您的芯片组版本具有该指令，那么您就赢了！另外，如果你有一个双向移位器，移位指令会很容易实现吗？

有一个 Lg[N] 算法用于计算最高设置位 here.

就幻数而言，对于更改表大小，ftbl2 的幻数是 32，但请注意 6 (Lg[32]+1) 用于移位。

【讨论】：

FWIW，虽然我真的不推荐这个，但你可以将你的整体错误四分之一，有一些偏差，即：int v1=fisqrt(val); v1+=fisqrt(val-v1*v1)/16； 16 是 2 的幂，在 1->2^24 范围内效果最好。【参考方案3】：

一种常见的方法是二分法。

hi = number
lo = 0
mid = ( hi + lo ) / 2
mid2 = mid*mid
while( lo < hi-1 and mid2 != number ) 
    if( mid2 < number ) 
        lo = mid
    else
        hi = mid
    mid = ( hi + lo ) / 2
    mid2 = mid*mid

类似的东西应该可以很好地工作。它进行 log2(number) 测试，做 log2(number) 乘除。由于除以 2，您可以将其替换为 >>。

终止条件可能不准确，所以一定要测试各种整数，以确保除以 2 不会在两个偶数之间错误地振荡；它们会相差超过 1。

【讨论】：

【参考方案4】：

速度不快，但又小又简单：

int isqrt(int n)

  int b = 0;

  while(n >= 0)
  
    n = n - b;
    b = b + 1;
    n = n - b;
  

  return b - 1;

【讨论】：

这是否使用整数溢出？【参考方案5】：

我发现大多数算法都基于简单的想法，但以一种比必要的更复杂的方式实现。我从这里得到了这个想法：http://ww1.microchip.com/downloads/en/AppNotes/91040a.pdf（Ross M. Fosler）并将它变成了一个非常短的 C 函数：

uint16_t int_sqrt32(uint32_t x)

    uint16_t res=0;
    uint16_t add= 0x8000;   
    int i;
    for(i=0;i<16;i++)
    
        uint16_t temp=res | add;
        uint32_t g2=temp*temp;      
        if (x>=g2)
        
            res=temp;           
        
        add>>=1;
    
    return res;

这在我的 blackfin 上编译为 5 个周期/位。我相信如果您使用 for 循环而不是 while 循环，您的编译代码通常会更快，并且您可以获得确定性时间的额外好处（尽管这在某种程度上取决于您的编译器如何优化 if 语句。）

【讨论】：

对不起，这应该是输出的 5 个周期/位，是输入的一半。所以 2.5 个周期/位的输入。这里有一个小错误。在表达式“temp*temp”中，您需要将任一操作数转换为 uint32_t 以确保乘法是在 32 位算术而不是 16 位中完成的。因此，代码原样在 AVR 上不起作用（但它似乎在 int 为 32 位的平台上，由于默认提升，但它仍然可能导致整数溢出）。另一件事：“uint16_t add= 0x8000;”应改为“uint16_t add= UINT16_C(0x8000);”。我没有对它进行基准测试，但根据@AmbrozBizjak 的建议得出了正确的结果，谢谢你们！嗯……进一步优化：使用do … while(add)而不是for循环，因为右移已经设置了条件寄存器，这应该保存三个指令（其中两个在循环中）。理论上。实际上，这只适用于clang -Os，其他优化模式设法使代码悲观。 GCC10 优化错误更糟糕，我已经提交了一个错误。【参考方案6】：

这取决于 sqrt 函数的用法。我经常使用一些近似来制作快速版本。例如，当我需要计算向量的模块时：

Module = SQRT( x^2 + y^2)

我用：

Module = MAX( x,y) + Min(x,y)/2

可以用 3 或 4 条指令编码为：

If (x > y )
  Module  = x + y >> 1;
Else
   Module  = y + x >> 1;

【讨论】：

需要注意的是，这是alpha max plus beta min算法，使用alpha = 1和beta = 1/2。 en.wikipedia.org/wiki/Alpha_max_plus_beta_min_algorithm【参考方案7】：

我已经习惯了类似于this Wikipedia article 中描述的二进制逐位算法。

【讨论】：

【参考方案8】：

我最近在 ARM Cortex-M3 (STM32F103CBT6) 上遇到了同样的任务，在网上搜索后想出了以下解决方案。与这里提供的解决方案相比，它不是最快的，但它具有良好的精度（最大误差为 1，即整个 UI32 输入范围上的 LSB）和相对较好的速度（在 72-MHz ARM Cortex-上约为每秒 1.3M 平方根- M3 或每个单根大约 55 个周期，包括函数调用）。

// FastIntSqrt is based on Wikipedia article:
// https://en.wikipedia.org/wiki/Methods_of_computing_square_roots
// Which involves Newton's method which gives the following iterative formula:
//
// X(n+1) = (X(n) + S/X(n))/2
//
// Thanks to ARM CLZ instruction (which counts how many bits in a number are
// zeros starting from the most significant one) we can very successfully
// choose the starting value, so just three iterations are enough to achieve
// maximum possible error of 1. The algorithm uses division, but fortunately
// it is fast enough here, so square root computation takes only about 50-55
// cycles with maximum compiler optimization.
uint32_t FastIntSqrt (uint32_t value)

    if (!value)
        return 0;

    uint32_t xn = 1 << ((32 - __CLZ (value))/2);
    xn = (xn + value/xn)/2;
    xn = (xn + value/xn)/2;
    xn = (xn + value/xn)/2;
    return xn;

我正在使用 IAR，它会生成以下汇编代码：

        SECTION `.text`:CODE:NOROOT(1)
        THUMB
_Z11FastIntSqrtj:
        MOVS     R1,R0
        BNE.N    ??FastIntSqrt_0
        MOVS     R0,#+0
        BX       LR
??FastIntSqrt_0:
        CLZ      R0,R1
        RSB      R0,R0,#+32
        MOVS     R2,#+1
        LSRS     R0,R0,#+1
        LSL      R0,R2,R0
        UDIV     R3,R1,R0
        ADDS     R0,R3,R0
        LSRS     R0,R0,#+1
        UDIV     R2,R1,R0
        ADDS     R0,R2,R0
        LSRS     R0,R0,#+1
        UDIV     R1,R1,R0
        ADDS     R0,R1,R0
        LSRS     R0,R0,#+1
        BX       LR               ;; return

【讨论】：

【参考方案9】：

这是一个结合整数 log_2 和牛顿方法来创建无循环算法的 Java 解决方案。不利的一面是，它需要划分。需要注释的行才能上转换为 64 位算法。

private static final int debruijn= 0x07C4ACDD;
//private static final long debruijn= ( ~0x0218A392CD3D5DBFL)>>>6;

static

  for(int x= 0; x<32; ++x)
  
    final long v= ~( -2L<<(x));
    DeBruijnArray[(int)((v*debruijn)>>>27)]= x; //>>>58
  
  for(int x= 0; x<32; ++x)
    SQRT[x]= (int) (Math.sqrt((1L<<DeBruijnArray[x])*Math.sqrt(2)));


public static int sqrt(final int num)

  int y;
  if(num==0)
    return num;
  
    int v= num;
    v|= v>>>1; // first round up to one less than a power of 2 
    v|= v>>>2;
    v|= v>>>4;
    v|= v>>>8;
    v|= v>>>16;
    //v|= v>>>32;
    y= SQRT[(v*debruijn)>>>27]; //>>>58
  
  //y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  return y*y>num?y-1:y;

这是如何工作的：第一部分产生一个精确到大约 3 位的平方根。 y= (y+num/y)>>1; 行将位精度加倍。最后一行消除了可以生成的屋顶根部。

【讨论】：

我在这个页面上尝试了 3 个其他实现，这个是我用 C# 实现时最快的。 Dave Gamble 的实现排在第二位，比这个慢了大约 25%。我相信大多数基于循环的循环都很慢...... 是的，这可能是您在具有除法但没有 FPU 或扩展位操作指令的 CPU 上可以做到的最快速度。值得注意的是，在某些计算机上，该算法的 64 位版本可能比 IEEE 754 double 获得更好的精度。我无法完成这项工作（假设SQRT 和DeBruijnArray 都是int[32]，并为int 添加必要的转换以使其编译）。在第一个初始化循环期间，它似乎越界了。代码经过测试。问题是我是否正确复制了它。其中之一是 64 位版本中的 int[64]。【参考方案10】：

ARM 编码最巧妙的按位整数平方根实现实现了每个结果位 3 个周期，得出 32 位无符号整数平方根的下限为 50 个周期。 Andrew N. Sloss、Dominic Symes、Chris Wright、“ARM 系统开发人员指南”、Morgan Kaufman 2004 中显示了一个示例。

由于大多数 ARM 处理器还具有非常快的整数乘法器，并且大多数甚至提供了宽乘指令 UMULL 的非常快速的实现，因此可以实现大约 35 到 45 个周期的执行时间的另一种方法是通过计算倒数平方根 1/√x 使用定点计算。为此，有必要借助 count-leading-zeros 指令对输入进行规范化，该指令在大多数 ARM 处理器上可用作指令 CLZ。

计算从查找表中的初始低精度倒数平方根近似值开始，该查找表由标准化参数的一些最高有效位索引。用二次收敛来细化数字a 的倒数平方根r 的 Newton-Raphson 迭代是 r_n+1 = r_n + r_n* (1 - a * r_n²) / 2。这可以重新安排成代数等价形式，方便。在下面的示例 C99 代码中，从 96 项查找表中读取 8 位近似值 r₀。这个近似值精确到大约 7 位。第一次 Newton-Raphson 迭代计算 r₁ = (3 * r₀ - a * r₀³) /2 以潜在地利用小型操作数乘法指令。第二次 Newton-Raphson 迭代计算 r₂ = (r₁ * (3 - r₁ * (r₁ * a))) / 2.

然后通过反向乘法 s₂ = a * r₂ 计算归一化的平方根，并通过基于前导零计数的非归一化来实现最终近似原始论点a。重要的是，期望的结果⌊√a⌋被低估了。这通过保证余数 ⌊√a⌋ - s₂ * s₂ 为正数来简化检查是否已达到所需结果。如果发现最终的近似值太小，则将结果增加一。通过针对“黄金”参考对所有可能的 2³² 输入进行详尽的测试，可以很容易地证明该算法的正确操作，这只需要几分钟。

可以通过预先计算 3 * r₀ 和 r₀^{3 以简化第一次 Newton-Raphson 迭代。前者需要 10 位存储，后者需要 24 位。为了将每一对组合成一个 32 位数据项，立方体被四舍五入到 22 位，这在计算中引入了可忽略的误差。这会产生一个 96 * 4 = 384 字节的查找表。}

另一种方法是观察到所有起始近似值都具有相同的最高有效位集，因此可以隐式假设并且不必存储。这允许将 9 位近似值 r₀ 压缩到 8 位数据项中，并在查找表后恢复前导位。使用包含 384 个条目的查找表，都低估了，可以达到大约 7.5 位的精度。将反向乘法与倒数平方根的 Newton-Raphson 迭代相结合，计算出 s₀ = a * r₀, s₁ = s₀ + r₀ * (a - s₀ * s₀) / 2. 因为精度起始近似值对于非常精确的最终平方根近似值来说不够高，它最多可以偏离 3，并且必须根据余数下限 (sqrt (a)) - s₁ * s₁.

替代方法的一个优点是所需的乘法次数减半，特别是只需要一次宽乘法UMULL。尤其是宽乘相当慢的处理器，这是一个值得尝试的替代方案。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#if defined(_MSC_VER) && defined(_WIN64)
#include <intrin.h>
#endif // defined(_MSC_VER) && defined(_WIN64)

#define CLZ_BUILTIN (1)      // use compiler's built-in count-leading-zeros
#define CLZ_FPU     (2)      // emulate count-leading-zeros via FPU
#define CLZ_CPU     (3)      // emulate count-leading-zeros via CPU

#define ALT_IMPL    (0)      // alternative implementation with fewer multiplies
#define LARGE_TABLE (0)      // ALT_IMPL=0: incorporate 1st NR-iter into table
#define CLZ_IMPL    (CLZ_CPU)// choose count-leading-zeros implementation
#define GEN_TAB     (0)      // generate tables

uint32_t umul32_hi (uint32_t a, uint32_t b);
uint32_t float_as_uint32 (float a);
int clz32 (uint32_t a);

#if ALT_IMPL
uint8_t rsqrt_tab [384] = 

    0xfe, 0xfc, 0xfa, 0xf8, 0xf6, 0xf4, 0xf2, 0xf0, 0xee, 0xed, 0xeb, 0xe9,
    0xe7, 0xe6, 0xe4, 0xe2, 0xe1, 0xdf, 0xdd, 0xdc, 0xda, 0xd8, 0xd7, 0xd5,
    0xd4, 0xd2, 0xd1, 0xcf, 0xce, 0xcc, 0xcb, 0xc9, 0xc8, 0xc7, 0xc5, 0xc4,
    0xc2, 0xc1, 0xc0, 0xbe, 0xbd, 0xbc, 0xba, 0xb9, 0xb8, 0xb7, 0xb5, 0xb4,
    0xb3, 0xb2, 0xb0, 0xaf, 0xae, 0xad, 0xac, 0xab, 0xa9, 0xa8, 0xa7, 0xa6,
    0xa5, 0xa4, 0xa3, 0xa2, 0xa0, 0x9f, 0x9e, 0x9d, 0x9c, 0x9b, 0x9a, 0x99,
    0x98, 0x97, 0x96, 0x95, 0x94, 0x93, 0x92, 0x91, 0x90, 0x8f, 0x8e, 0x8d,
    0x8c, 0x8b, 0x8b, 0x8a, 0x89, 0x88, 0x87, 0x86, 0x85, 0x84, 0x83, 0x83,
    0x82, 0x81, 0x80, 0x7f, 0x7e, 0x7d, 0x7d, 0x7c, 0x7b, 0x7a, 0x79, 0x79,
    0x78, 0x77, 0x76, 0x75, 0x75, 0x74, 0x73, 0x72, 0x72, 0x71, 0x70, 0x6f,
    0x6f, 0x6e, 0x6d, 0x6c, 0x6c, 0x6b, 0x6a, 0x6a, 0x69, 0x68, 0x67, 0x67,
    0x66, 0x65, 0x65, 0x64, 0x63, 0x63, 0x62, 0x61, 0x61, 0x60, 0x5f, 0x5f,
    0x5e, 0x5d, 0x5d, 0x5c, 0x5c, 0x5b, 0x5a, 0x5a, 0x59, 0x58, 0x58, 0x57,
    0x57, 0x56, 0x55, 0x55, 0x54, 0x54, 0x53, 0x52, 0x52, 0x51, 0x51, 0x50,
    0x50, 0x4f, 0x4e, 0x4e, 0x4d, 0x4d, 0x4c, 0x4c, 0x4b, 0x4b, 0x4a, 0x4a,
    0x49, 0x48, 0x48, 0x47, 0x47, 0x46, 0x46, 0x45, 0x45, 0x44, 0x44, 0x43,
    0x43, 0x42, 0x42, 0x41, 0x41, 0x40, 0x40, 0x3f, 0x3f, 0x3e, 0x3e, 0x3d,
    0x3d, 0x3c, 0x3c, 0x3c, 0x3b, 0x3b, 0x3a, 0x3a, 0x39, 0x39, 0x38, 0x38,
    0x37, 0x37, 0x36, 0x36, 0x36, 0x35, 0x35, 0x34, 0x34, 0x33, 0x33, 0x33,
    0x32, 0x32, 0x31, 0x31, 0x30, 0x30, 0x30, 0x2f, 0x2f, 0x2e, 0x2e, 0x2d,
    0x2d, 0x2d, 0x2c, 0x2c, 0x2b, 0x2b, 0x2b, 0x2a, 0x2a, 0x29, 0x29, 0x29,
    0x28, 0x28, 0x27, 0x27, 0x27, 0x26, 0x26, 0x26, 0x25, 0x25, 0x24, 0x24,
    0x24, 0x23, 0x23, 0x23, 0x22, 0x22, 0x21, 0x21, 0x21, 0x20, 0x20, 0x20,
    0x1f, 0x1f, 0x1f, 0x1e, 0x1e, 0x1e, 0x1d, 0x1d, 0x1d, 0x1c, 0x1c, 0x1c,
    0x1b, 0x1b, 0x1a, 0x1a, 0x1a, 0x19, 0x19, 0x19, 0x18, 0x18, 0x18, 0x17,
    0x17, 0x17, 0x17, 0x16, 0x16, 0x16, 0x15, 0x15, 0x15, 0x14, 0x14, 0x14,
    0x13, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10,
    0x10, 0x0f, 0x0f, 0x0f, 0x0e, 0x0e, 0x0e, 0x0e, 0x0d, 0x0d, 0x0d, 0x0c,
    0x0c, 0x0c, 0x0c, 0x0b, 0x0b, 0x0b, 0x0a, 0x0a, 0x0a, 0x0a, 0x09, 0x09,
    0x09, 0x08, 0x08, 0x08, 0x08, 0x07, 0x07, 0x07, 0x07, 0x06, 0x06, 0x06,
    0x05, 0x05, 0x05, 0x05, 0x04, 0x04, 0x04, 0x04, 0x03, 0x03, 0x03, 0x03,
    0x02, 0x02, 0x02, 0x02, 0x01, 0x01, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00,
;

/* compute floor (sqrt (a)) */
uint32_t my_isqrt32 (uint32_t a)

    uint32_t b, r, s, scal, rem;

    if (a == 0) return a;
    /* Normalize argument */
    scal = clz32 (a) & ~1;
    b = a << scal;
    /* Compute initial approximation to 1/sqrt(a) */
    r = rsqrt_tab [(b >> 23) - 128] | 0x100;
    /* Compute initial approximation to sqrt(a) */
    s = umul32_hi (b, r << 8); 
    /* Refine sqrt approximation */
    b = b - s * s;
    s = s + ((r * (b >> 2)) >> 23);
    /* Denormalize result*/
    s = s >> (scal >> 1);
    /* Ensure we got the floor correct */
    rem = a - s * s;
    if      (rem < (2 * s + 1)) s = s + 0;
    else if (rem < (4 * s + 4)) s = s + 1;
    else if (rem < (6 * s + 9)) s = s + 2;
    else                        s = s + 3;
    return s;


#else // ALT_IMPL

#if LARGE_TABLE
uint32_t rsqrt_tab [96] = 

    0xfa0bfafa, 0xee6b2aee, 0xe5f02ae5, 0xdaf26ed9, 0xd2f002d0, 0xc890c2c4,
    0xc1037abb, 0xb9a75ab2, 0xb4da42ac, 0xadcea2a3, 0xa6f27a9a, 0xa279c294,
    0x9beb4a8b, 0x97a5ca85, 0x9163427c, 0x8d4fca76, 0x89500270, 0x8563ba6a,
    0x818ac264, 0x7dc4ea5e, 0x7a120258, 0x7671da52, 0x72e4424c, 0x6f690a46,
    0x6db24243, 0x6a52423d, 0x67042637, 0x6563c234, 0x62302a2e, 0x609cea2b,
    0x5d836a25, 0x5bfd1a22, 0x58fd421c, 0x5783ae19, 0x560e4a16, 0x53300210,
    0x51c7120d, 0x50623a0a, 0x4da4c204, 0x4c4c1601, 0x4af769fe, 0x49a6b9fb,
    0x485a01f8, 0x471139f5, 0x45cc59f2, 0x448b5def, 0x4214fde9, 0x40df89e6,
    0x3fade1e3, 0x3e8001e0, 0x3d55e1dd, 0x3c2f79da, 0x3c2f79da, 0x3b0cc5d7,
    0x39edc1d4, 0x38d265d1, 0x37baa9ce, 0x36a689cb, 0x359601c8, 0x348909c5,
    0x348909c5, 0x337f99c2, 0x3279adbf, 0x317741bc, 0x30784db9, 0x30784db9,
    0x2f7cc9b6, 0x2e84b1b3, 0x2d9001b0, 0x2d9001b0, 0x2c9eb1ad, 0x2bb0b9aa,
    0x2bb0b9aa, 0x2ac615a7, 0x29dec1a4, 0x29dec1a4, 0x28fab5a1, 0x2819e99e,
    0x2819e99e, 0x273c599b, 0x273c599b, 0x26620198, 0x258ad995, 0x258ad995,
    0x24b6d992, 0x24b6d992, 0x23e5fd8f, 0x2318418c, 0x2318418c, 0x224d9d89,
    0x224d9d89, 0x21860986, 0x21860986, 0x20c18183, 0x20c18183, 0x20000180,
;
#else // LARGE_TABLE
uint8_t rsqrt_tab [96] = 

    0xfe, 0xfa, 0xf7, 0xf3, 0xf0, 0xec, 0xe9, 0xe6, 0xe4, 0xe1, 0xde, 0xdc,
    0xd9, 0xd7, 0xd4, 0xd2, 0xd0, 0xce, 0xcc, 0xca, 0xc8, 0xc6, 0xc4, 0xc2,
    0xc1, 0xbf, 0xbd, 0xbc, 0xba, 0xb9, 0xb7, 0xb6, 0xb4, 0xb3, 0xb2, 0xb0,
    0xaf, 0xae, 0xac, 0xab, 0xaa, 0xa9, 0xa8, 0xa7, 0xa6, 0xa5, 0xa3, 0xa2,
    0xa1, 0xa0, 0x9f, 0x9e, 0x9e, 0x9d, 0x9c, 0x9b, 0x9a, 0x99, 0x98, 0x97,
    0x97, 0x96, 0x95, 0x94, 0x93, 0x93, 0x92, 0x91, 0x90, 0x90, 0x8f, 0x8e,
    0x8e, 0x8d, 0x8c, 0x8c, 0x8b, 0x8a, 0x8a, 0x89, 0x89, 0x88, 0x87, 0x87,
    0x86, 0x86, 0x85, 0x84, 0x84, 0x83, 0x83, 0x82, 0x82, 0x81, 0x81, 0x80,
;
#endif //LARGE_TABLE 

/* compute floor (sqrt (a)) */
uint32_t my_isqrt32 (uint32_t a)

    uint32_t b, r, s, t, scal, rem;

    if (a == 0) return a;
    /* Normalize argument */
    scal = clz32 (a) & ~1;
    b = a << scal;
    /* Initial approximation to 1/sqrt(a)*/
    t = rsqrt_tab [(b >> 25) - 32];
    /* First NR iteration */
#if LARGE_TABLE
    r = (t << 22) - umul32_hi (b, t);
#else // LARGE_TABLE
    r = ((3 * t) << 22) - umul32_hi (b, (t * t * t) << 8);
#endif // LARGE_TABLE
    /* Second NR iteration */
    s = umul32_hi (r, b);
    s = 0x30000000 - umul32_hi (r, s);
    r = umul32_hi (r, s);
    /* Compute sqrt(a) = a * 1/sqrt(a). Adjust to ensure it's an underestimate*/
    r = umul32_hi (r, b) - 1;
    /* Denormalize result */
    r = r >> ((scal >> 1) + 11);
    /* Make sure we got the floor correct */
    rem = a - r * r;
    if (rem >= (2 * r + 1)) r++;
    return r;

#endif // ALT_IMPL

uint32_t umul32_hi (uint32_t a, uint32_t b)

    return (uint32_t)(((uint64_t)a * b) >> 32);


uint32_t float_as_uint32 (float a)

    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;


int clz32 (uint32_t a)

#if (CLZ_IMPL == CLZ_FPU)
    // Henry S. Warren, Jr, " Hacker's Delight 2nd ed", p. 105
    int n = 158 - (float_as_uint32 ((float)(int32_t)(a & ~(a >> 1))+.5f) >> 23);
    return (n < 0) ? 0 : n;
#elif (CLZ_IMPL == CLZ_CPU)
    static const uint8_t clz_tab[32] = 
        31, 22, 30, 21, 18, 10, 29,  2, 20, 17, 15, 13, 9,  6, 28, 1,
        23, 19, 11,  3, 16, 14,  7, 24, 12,  4,  8, 25, 5, 26, 27, 0
    ;
    a |= a >> 16;
    a |= a >> 8;
    a |= a >> 4;
    a |= a >> 2;
    a |= a >> 1;
    return clz_tab [0x07c4acddu * a >> 27] + (!a);
#else // CLZ_IMPL == CLZ_BUILTIN
#if defined(_MSC_VER) && defined(_WIN64)
    return __lzcnt (a);
#else // defined(_MSC_VER) && defined(_WIN64)
    return __builtin_clz (a);
#endif // defined(_MSC_VER) && defined(_WIN64)
#endif // CLZ_IMPL


/* Henry S. Warren, Jr.,  "Hacker's Delight, 2nd e.d", p. 286 */
uint32_t ref_isqrt32 (uint32_t x)

    uint32_t m, y, b;
    m = 0x40000000U;
    y = 0U;
    while (m != 0) 
        b = y | m;
        y = y >> 1;
        if (x >= b) 
            x = x - b;
            y = y | m;
        
        m = m >> 2;
    
    return y;


#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)

    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) 
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    
    if (hasHighResTimer) 
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
     else 
        return (double)GetTickCount() * 1.0e-3;
    

#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)

    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;

#else
#error unsupported platform
#endif

int main (void)

#if ALT_IMPL
    printf ("Alternate integer square root implementation\n");
#else // ALT_IMPL
#if LARGE_TABLE
    printf ("Integer square root implementation w/ large table\n");
#else // LARGE_TAB
    printf ("Integer square root implementation w/ small table\n");
#endif
#endif // ALT_IMPL

#if GEN_TAB
    printf ("Generating lookup table ...\n");
#if ALT_IMPL
      for (int i = 0; i < 384; i++) 
        double x = 1.0 + (i + 1) * 1.0 / 128;
        double y = 1.0 / sqrt (x);
        uint8_t val = (uint8_t)((y * 512) - 256);
        rsqrt_tab[i] = val;
        printf ("0x%02x, ", rsqrt_tab[i]);
        if (i % 12 == 11) printf("\n");
    
#else // ALT_IMPL
    for (int i = 0; i < 96; i++ ) 
        double x1 = 1.0 + i * 1.0 / 32;
        double x2 = 1.0 + (i + 1) * 1.0 / 32;
        double y = (1.0 / sqrt(x1) + 1.0 / sqrt(x2)) * 0.5;
        uint32_t val = (uint32_t)(y * 256 + 0.5);
#if LARGE_TABLE
        uint32_t cube = val * val * val;
        rsqrt_tab[i] = (((cube + 1) / 4) << 10) + (3 * val);
        printf ("0x%08x, ", rsqrt_tab[i]);
        if (i % 6 == 5) printf ("\n");
#else // LARGE_TABLE
        rsqrt_tab[i] = val;
        printf ("0x%02x, ", rsqrt_tab[i]);
        if (i % 12 == 11) printf ("\n");
#endif // LARGE_TABLE
    
#endif // ALT_IMPL
#endif // GEN_TAB
    printf ("Running exhaustive test ... ");
    uint32_t i = 0;
    do 
        uint32_t ref = ref_isqrt32 (i);
        uint32_t res = my_isqrt32 (i);
        if (res != ref) 
            printf ("error: arg=%08x  res=%08x  ref=%08x\n", i, res, ref);
            return EXIT_FAILURE;
        
        i++;
     while (i);

    printf ("PASSED\n");
    printf ("Running benchmark ...\n");
    i = 0;
    uint32_t sum[8] = 0, 0, 0, 0, 0, 0, 0, 0;
    double start = second();
    do 
        sum [0] += my_isqrt32 (i + 0);
        sum [1] += my_isqrt32 (i + 1);
        sum [2] += my_isqrt32 (i + 2);
        sum [3] += my_isqrt32 (i + 3);
        sum [4] += my_isqrt32 (i + 4);
        sum [5] += my_isqrt32 (i + 5);
        sum [6] += my_isqrt32 (i + 6);
        sum [7] += my_isqrt32 (i + 7);
        i += 8;
     while (i);
    double stop = second();
    printf ("%08x \relapsed=%.5f sec\n", 
            sum[0]+sum[1]+sum[2]+sum[3]+sum[4]+sum[5]+sum[6]+sum[7],
            stop - start);
    return EXIT_SUCCESS;

【讨论】：

@dancxviii 对于 64 位，请参阅 this answer @dancxviii 对于 32 位版本，我根据最新的 gcc 和 Clang 生成的汇编代码和已发布的指令周期计数估计个周期（某些实现具有快速 UMULL其他人没有）。我现在没有可以运行的 ARM 平台。我首先有一个更简单更快的 64 位版本，但它在 2**63 左右的罕见测试用例中失败了。 @dancxviii 如果您有更好的解决方案，我建议您以答案的形式发布。我尝试对我的 64 位实现进行一些简单的调整，以通过修复功能错误来恢复我失去的性能，但无济于事。我的原始代码中的测试失败是由于在一万亿个测试用例中几十个参数缺少一个很小的 epsilon 量的舍入边界。真的很让人抓狂。 @dancxviii 使用严格的数学方法绝对可以证明这样的算法是正确的。据我所知，这正是在其他情况下所做的。但我不具备必要的专业知识。我是一名软件工程师，而不是数学家。【参考方案11】：

此方法类似于长除法：您对根的下一位数字进行猜测，进行减法，如果差值符合特定标准，则输入该数字。对于二进制版本，下一个数字的唯一选择是 0 或 1，因此您总是猜 1，做减法，然后输入 1，除非差为负数。

http://www.realitypixels.com/turk/opensource/index.html#FractSqrt

【讨论】：

【参考方案12】：

如果您只需要 ARM Thumb 2 处理器，ARM 的 CMSIS DSP 库是您的最佳选择。它是由设计 Thumb 2 处理器的人制作的。还有谁能打败它？

实际上，您甚至不需要算法，而是需要专门的平方根硬件指令，例如 VSQRT。这家 ARM 公司通过尝试使用 VSQRT 等硬件来维护针对 Thumb 2 支持的处理器高度优化的数学和 DSP 算法实现。可以获取源代码：

arm_sqrt_f32() arm_sqrt_q15.c / arm_sqrt_q31.c（q15 和 q31 是专门用于 ARM DSP 内核的定点数据类型，通常与 Thum 2 兼容处理器一起提供。）

请注意，ARM 还维护 CMSIS DSP 的编译二进制文件，以保证 ARM Thumb 架构特定指令的最佳性能。所以你应该在使用库时考虑静态链接它们。 You can get the binaries here.

【讨论】：

Kim，感谢您的回答，这对许多人来说肯定有用。但是，我正在寻找整数平方根。【参考方案13】：

我在 C# 中为 64 位整数实现了 Warren 的建议和 Newton 方法。 Isqrt 使用 Newton 方法，而 Isqrt 使用 Warren 方法。以下是源代码：

using System;

namespace Cluster

    public static class IntegerMath
    


        /// <summary>
        /// Compute the integer square root, the largest whole number less than or equal to the true square root of N.
        /// 
        /// This uses the integer version of Newton's method.
        /// </summary>
        public static long Isqrt(this long n)
        
            if (n < 0) throw new ArgumentOutOfRangeException("n", "Square root of negative numbers is not defined.");
            if (n <= 1) return n;

            var xPrev2 = -2L;
            var xPrev1 = -1L;
            var x = 2L;
            // From Wikipedia: if N + 1 is a perfect square, then the algorithm enters a two-value cycle, so we have to compare 
            // to two previous values to test for convergence.
            while (x != xPrev2 && x != xPrev1)
            
                xPrev2 = xPrev1;
                xPrev1 = x;
                x = (x + n/x)/2;
            
            // The two values x and xPrev1 will be above and below the true square root. Choose the lower one.
            return x < xPrev1 ? x : xPrev1;
        

        #region Sqrt using Bit-shifting and magic numbers.

        // From http://***.com/questions/1100090/looking-for-an-efficient-integer-square-root-algorithm-for-arm-thumb2
        // Converted to C#.
        private static readonly ulong debruijn= ( ~0x0218A392CD3D5DBFUL)>>6;
        private static readonly ulong[] SQRT = new ulong[64];
        private static readonly int[] DeBruijnArray = new int[64];

        static IntegerMath()
        
          for(int x= 0; x<64; ++x)
          
            ulong v= (ulong) ~( -2L<<(x));
            DeBruijnArray[(v*debruijn)>>58]= x;
          
          for(int x= 0; x<64; ++x)
            SQRT[x]= (ulong) (Math.Sqrt((1L<<DeBruijnArray[x])*Math.Sqrt(2)));
        

        public static long Isqrt2(this long n)
        
          ulong num = (ulong) n; 
          ulong y;
          if(num==0)
            return (long)num;
          
            ulong v= num;
            v|= v>>1; // first round up to one less than a power of 2 
            v|= v>>2;
            v|= v>>4;
            v|= v>>8;
            v|= v>>16;
            v|= v>>32;
            y= SQRT[(v*debruijn)>>58];
          
          y= (y+num/y)>>1;
          y= (y+num/y)>>1;
          y= (y+num/y)>>1;
          y= (y+num/y)>>1;
          // Make sure that our answer is rounded down, not up.
          return (long) (y*y>num?y-1:y);
        

        #endregion

我使用以下代码对代码进行基准测试：

using System;
using System.Diagnostics;
using Cluster;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace ClusterTests

    [TestClass]
    public class IntegerMathTests
    
        [TestMethod]
        public void Isqrt_Accuracy()
        
            for (var n = 0L; n <= 100000L; n++)
            
                var expectedRoot = (long) Math.Sqrt(n);
                var actualRoot = n.Isqrt();
                Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = 0.", n));
            
        

        [TestMethod]
        public void Isqrt2_Accuracy()
        
            for (var n = 0L; n <= 100000L; n++)
            
                var expectedRoot = (long)Math.Sqrt(n);
                var actualRoot = n.Isqrt2();
                Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = 0.", n));
            
        

        [TestMethod]
        public void Isqrt_Speed()
        
            var integerTimer = new Stopwatch();
            var libraryTimer = new Stopwatch();

            integerTimer.Start();
            var total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            
                var root = n.Isqrt();
                total += root;
            
            integerTimer.Stop();

            libraryTimer.Start();
            total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            
                var root = (long)Math.Sqrt(n);
                total += root;
            
            libraryTimer.Stop();

            var isqrtMilliseconds = integerTimer.ElapsedMilliseconds;
            var libraryMilliseconds = libraryTimer.ElapsedMilliseconds;
            var msg = String.Format("Isqrt: 0 ms versus library: 1 ms", isqrtMilliseconds, libraryMilliseconds);
            Debug.WriteLine(msg);
            Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg);
        

        [TestMethod]
        public void Isqrt2_Speed()
        
            var integerTimer = new Stopwatch();
            var libraryTimer = new Stopwatch();

            var warmup = (10L).Isqrt2();

            integerTimer.Start();
            var total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            
                var root = n.Isqrt2();
                total += root;
            
            integerTimer.Stop();

            libraryTimer.Start();
            total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            
                var root = (long)Math.Sqrt(n);
                total += root;
            
            libraryTimer.Stop();

            var isqrtMilliseconds = integerTimer.ElapsedMilliseconds;
            var libraryMilliseconds = libraryTimer.ElapsedMilliseconds;
            var msg = String.Format("isqrt2: 0 ms versus library: 1 ms", isqrtMilliseconds, libraryMilliseconds);
            Debug.WriteLine(msg);
            Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg);

我在发布模式下的 Dell Latitude E6540、Visual Studio 2012 上的结果是图书馆调用 Math.Sqrt 更快。

59 毫秒 - 牛顿 (Isqrt) 12 毫秒 - 位移 (Iqrt2) 5 毫秒 - Math.Sqrt

我对编译器指令并不聪明，因此可以调整编译器以更快地获得整数数学。显然，移位方法非常接近库。在没有数学协处理器的系统上，它会非常快。

【讨论】：

【参考方案14】：

我为 RGB 伽玛压缩设计了一个 16 位 sqrt。它根据高 8 位分派到 3 个不同的表中。缺点：它使用大约一千字节的表格，如果不可能精确的 sqrt，则无法进行四舍五入，并且在最坏的情况下，使用单次乘法（但仅适用于极少数输入值）。

static uint8_t sqrt_50_256[] = 
  114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,
  133,134,135,136,137,138,139,140,141,142,143,143,144,145,146,147,148,149,150,
  150,151,152,153,154,155,155,156,157,158,159,159,160,161,162,163,163,164,165,
  166,167,167,168,169,170,170,171,172,173,173,174,175,175,176,177,178,178,179,
  180,181,181,182,183,183,184,185,185,186,187,187,188,189,189,190,191,191,192,
  193,193,194,195,195,196,197,197,198,199,199,200,201,201,202,203,203,204,204,
  205,206,206,207,207,208,209,209,210,211,211,212,212,213,214,214,215,215,216,
  217,217,218,218,219,219,220,221,221,222,222,223,223,224,225,225,226,226,227,
  227,228,229,229,230,230,231,231,232,232,233,234,234,235,235,236,236,237,237,
  238,238,239,239,240,241,241,242,242,243,243,244,244,245,245,246,246,247,247,
  248,248,249,249,250,250,251,251,252,252,253,253,254,254,255,255
;

static uint8_t sqrt_0_10[] = 
  1,2,3,3,4,4,5,5,5,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9,9,10,10,10,10,10,11,11,11,
  11,11,11,12,12,12,12,12,12,13,13,13,13,13,13,13,14,14,14,14,14,14,14,15,15,
  15,15,15,15,15,15,16,16,16,16,16,16,16,16,17,17,17,17,17,17,17,17,17,18,18,
  18,18,18,18,18,18,18,19,19,19,19,19,19,19,19,19,19,20,20,20,20,20,20,20,20,
  20,20,21,21,21,21,21,21,21,21,21,21,21,22,22,22,22,22,22,22,22,22,22,22,23,
  23,23,23,23,23,23,23,23,23,23,23,24,24,24,24,24,24,24,24,24,24,24,24,25,25,
  25,25,25,25,25,25,25,25,25,25,25,26,26,26,26,26,26,26,26,26,26,26,26,26,27,
  27,27,27,27,27,27,27,27,27,27,27,27,27,28,28,28,28,28,28,28,28,28,28,28,28,
  28,28,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,30,30,30,30,30,30,30,30,
  30,30,30,30,30,30,30,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,32,32,
  32,32,32,32,32,32,32,32,32,32,32,32,32,32,33,33,33,33,33,33,33,33,33,33,33,
  33,33,33,33,33,33,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,35,35,
  35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,36,36,36,36,36,36,36,36,36,
  36,36,36,36,36,36,36,36,36,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,
  37,37,37,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,39,39,39,
  39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,40,40,40,40,40,40,40,40,
  40,40,40,40,40,40,40,40,40,40,40,40,41,41,41,41,41,41,41,41,41,41,41,41,41,
  41,41,41,41,41,41,41,41,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,
  42,42,42,42,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,
  43,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,45,45,
  45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,46,46,46,46,
  46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,47,47,47,47,47,47,
  47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,48,48,48,48,48,48,48,
  48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,49,49,49,49,49,49,49,49,
  49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,50,50,50,50,50,50,50,50,
  50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,51,51,51,51,51,51,51,51,
  51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,52,52,52,52,52,52,52,
  52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,53,53
;

static uint8_t sqrt_11_49[] = 
  54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,0,76,77,78,
  0,79,80,81,82,83,0,84,85,86,0,87,88,89,0,90,0,91,92,93,0,94,0,95,96,97,0,98,0,
  99,100,101,0,102,0,103,0,104,105,106,0,107,0,108,0,109,0,110,0,111,112,113
;

uint16_t isqrt16(uint16_t v) 
  uint16_t a, b;
  uint16_t h = v>>8;
  if (h <= 10) return v ? sqrt_0_10[v>>2] : 0;
  if (h >= 50) return sqrt_50_256[h-50];
  h = (h-11)<<1;
  a = sqrt_11_49[h];
  b = sqrt_11_49[h+1];
  if (!a) return b;
  return b*b > v ? a : b;

我已经将它与基于 log2 的 sqrt 进行了比较，使用 clang 的 __builtin_clz（应该扩展为单个程序集操作码）和库的 sqrtf，使用 (int)sqrtf((float)i) 调用。并得到了相当奇怪的结果：

$ gcc -O3 test.c -o test && ./test 
isqrt16: 6.498955
sqrtf: 6.981861
log2_sqrt: 61.755873

Clang 将对 sqrtf 的调用编译为 sqrtss 指令，该指令几乎与表 sqrt 一样快。经验教训：在 x86 上，编译器可以提供足够快的sqrt，这比你自己想出的速度慢不到 10%，浪费了很多时间，或者如果你使用一些丑陋的按位，可以快 10 倍黑客。而且sqrtss还是比自定义函数慢一点，所以如果你真的需要这5%，你可以得到它们，例如ARM没有sqrtss，所以log2_sqrt应该不会落后那么糟糕。

在提供 FPU 的 x86 上，old Quake hack 似乎是计算整数 sqrt 的最快方法。它比这张表或 FPU 的 sqrtss 快 2 倍。

【讨论】：

以上是关于为 ARM Thumb2 寻找有效的整数平方根算法的主要内容，如果未能解决你的问题，请参考以下文章

优雅的点（）

计算乘法为完美平方的所有可能对的有效算法[关闭]