C++ 性能挑战:整数到 std::string 的转换
Posted
技术标签:
【中文标题】C++ 性能挑战:整数到 std::string 的转换【英文标题】:C++ performance challenge: integer to std::string conversion 【发布时间】:2011-05-20 01:53:20 【问题描述】:谁能将我的整数的性能与 std::string 代码相媲美,链接如下?
已经有几个问题解释了如何在 C++ 中将整数转换为 std::string
,例如 this one,但提供的解决方案都不是有效的。
这里是一些常见的竞争方法的可编译代码:
“C++ 方式”,使用字符串流:http://ideone.com/jh3Sa sprintf,SO-ers 通常向注重性能的人推荐:http://ideone.com/82kwR与popular belief 不同,boost::lexical_cast
有自己的实现 (white paper) 并且不使用 stringstream
和数字插入运算符。我真的很想看看它的性能比较,因为this other question suggests that it's miserable。
还有我自己的贡献,它在台式计算机上具有竞争力,并演示了一种在嵌入式系统上全速运行的方法,这与依赖于整数模的算法不同:
Ben 的算法:http://ideone.com/SsEUW如果您想使用该代码,我将在简化的 BSD 许可下提供它(允许商业使用,需要注明出处)。随便问问。
最后,函数ltoa
是非标准的,但可以广泛使用。
我会尽快发布我的绩效测量结果作为答案。
算法规则
提供将至少 32 位有符号和无符号整数转换为十进制的代码。 将输出生成为std::string
。
没有与线程和信号不兼容的技巧(例如,静态缓冲区)。
您可以假定为 ASCII 字符集。
确保在无法表示绝对值的二进制补码机器上测试INT_MIN
上的代码。
理想情况下,输出应与使用 stringstream
、http://ideone.com/jh3Sa 的规范 C++ 版本逐字符相同,但任何可以清楚理解为正确数字的内容也可以。
新:尽管您可以使用任何您想要的编译器和优化器选项(完全禁用除外)进行比较,但代码还需要至少在 VC++ 2010 和 g++ 下编译并给出正确的结果。
希望讨论
除了更好的算法,我还想在几个不同的平台和编译器上获得一些基准(让我们使用 MB/s 吞吐量作为我们的标准度量单位)。我相信我的算法代码(我知道sprintf
基准测试采用了一些捷径——现在已修复)是标准明确定义的行为,至少在 ASCII 假设下,但如果您看到任何未定义的行为或输入哪个输出无效,请指出。
结论:
g++ 和 VC2010 执行不同的算法,可能是由于 std::string
的不同实现。 VC2010 显然在 NRVO 方面做得更好,摆脱按值返回仅对 gcc 有帮助。
发现代码的性能优于 sprintf
一个数量级。 ostringstream
落后 50 倍以上。
挑战的获胜者是 user434507,他生成的代码在 gcc 上运行速度是我自己的 350%。由于 SO 社区的心血来潮,更多条目已关闭。
目前(最终?)速度冠军是:
对于 gcc:user434507,比sprintf
快 8 倍:http://ideone.com/0uhhX
对于 Visual C++:Timo,比 sprintf
快 15 倍:http://ideone.com/VpKO3
【问题讨论】:
我认为这个“问题”更适合这里programmers.stackexchange.com 您的问题未明确说明,因为它没有解释结果字符串的外观。最有可能的是,总是返回空字符串不会被认为是可接受的,但符合规范。 我投票决定重新打开这个问题,没有理由关闭它。 在这个问题上,ideone链接大多是死的。您能否将代码包含在更可靠的地方? @BenVoigt 我也会问同样的问题。链接都死了。我很想更仔细地看看这些 【参考方案1】:#include <string>
const char digit_pairs[201] =
"00010203040506070809"
"10111213141516171819"
"20212223242526272829"
"30313233343536373839"
"40414243444546474849"
"50515253545556575859"
"60616263646566676869"
"70717273747576777879"
"80818283848586878889"
"90919293949596979899"
;
std::string& itostr(int n, std::string& s)
if(n==0)
s="0";
return s;
int sign = -(n<0);
unsigned int val = (n^sign)-sign;
int size;
if(val>=10000)
if(val>=10000000)
if(val>=1000000000)
size=10;
else if(val>=100000000)
size=9;
else
size=8;
else
if(val>=1000000)
size=7;
else if(val>=100000)
size=6;
else
size=5;
else
if(val>=100)
if(val>=1000)
size=4;
else
size=3;
else
if(val>=10)
size=2;
else
size=1;
size -= sign;
s.resize(size);
char* c = &s[0];
if(sign)
*c='-';
c += size-1;
while(val>=100)
int pos = val % 100;
val /= 100;
*(short*)(c-1)=*(short*)(digit_pairs+2*pos);
c-=2;
while(val>0)
*c--='0' + (val % 10);
val /= 10;
return s;
std::string& itostr(unsigned val, std::string& s)
if(val==0)
s="0";
return s;
int size;
if(val>=10000)
if(val>=10000000)
if(val>=1000000000)
size=10;
else if(val>=100000000)
size=9;
else
size=8;
else
if(val>=1000000)
size=7;
else if(val>=100000)
size=6;
else
size=5;
else
if(val>=100)
if(val>=1000)
size=4;
else
size=3;
else
if(val>=10)
size=2;
else
size=1;
s.resize(size);
char* c = &s[size-1];
while(val>=100)
int pos = val % 100;
val /= 100;
*(short*)(c-1)=*(short*)(digit_pairs+2*pos);
c-=2;
while(val>0)
*c--='0' + (val % 10);
val /= 10;
return s;
这将在不允许未对齐内存访问的系统上崩溃(在这种情况下,通过*(short*)
进行的第一个未对齐分配会导致段错误),但在其他情况下应该可以很好地工作。
要做的一件重要事情是尽量减少std::string
的使用。 (讽刺的是,我知道。)例如,在 Visual Studio 中,大多数对 std::string 方法的调用都不是内联的,即使您在编译器选项中指定 /Ob2 也是如此。因此,即使像调用 std::string::clear()
这样的微不足道的事情(您可能希望它非常快),在将 CRT 作为静态库链接时也可能需要 100 个时钟滴答,而在链接为 DLL 时可能需要多达 300 个时钟滴答。
出于同样的原因,通过引用返回更好,因为它避免了赋值、构造函数和析构函数。
【讨论】:
感谢您的尝试。在 ideone (ideone.com/BCp5r) 上,它的速度为 18.5 MB/s,大约是sprintf
速度的一半。使用 VC++ 2010,它可以达到大约 50 MB/s,大约是 sprintf 速度的两倍。
MB/s 是一个奇怪的指标,尤其是看到您在实现中如何不从字符串中删除尾随空格。我更新后的代码运行速度比您在 Core i7 920 上使用 x64 VC++ 2005 的实现速度更快(16.2M ops/s vs. 14.8M ops/s),_ltoa 运行速度为 8.5M ops/s,sprintf() 运行速度为 3.85M ops/s。
您的代码没有正确调整字符串的大小,我的确实如此(参见第 81、198 和 290 行)。我在sprintf
实现中采取了一些捷径,我已经在我的问题中提到了这一点,但我相信code-to-beat 给出的结果与stringstream 完全相同。
我还修复了 sprintf
包装器,以避免混淆。
顺便说一句,您的改进版 (ideone.com/GLAbS) 在 ideone 上达到 41.7 MB/s,在 VC++ 2010 32 位上达到大约 120 MB/s。【参考方案2】:
啊,顺便说一句,很棒的挑战……我玩得很开心。
我有两个算法要提交(代码在底部,如果您想跳过它)。在我的比较中,我要求函数返回一个字符串,并且它可以处理 int 和 unsigned int。将不构造字符串的事物与构造字符串的事物进行比较是没有意义的。
第一个是一个有趣的实现,它不使用任何预先计算的查找表或显式除法/模数。这个在 gcc 和除了 Timo 的 msvc 之外的所有其他方面都具有竞争力(这是我在下面解释的一个很好的理由)。第二种算法是我实际提交的最高性能。在我的测试中,它在 gcc 和 msvc 上都击败了所有其他人。
我想我知道为什么 MSVC 上的一些结果非常好。 std::string 有两个相关的构造函数
std::string(char* str, size_t n)
和 std::string(ForwardIterator b, ForwardIterator e)
gcc 对他们两个都做同样的事情......那就是它使用第二个来实现第一个。第一个构造函数可以比它更有效地实现,MSVC 就是这样做的。这样做的附带好处是,在某些情况下(例如我的快速代码和 Timo 的代码),可以内联字符串构造函数。事实上,仅仅在 MSVC 中的这些构造函数之间切换对我的代码来说几乎是 2 倍的差异。
我的性能测试结果:
代码来源:
- Voigt - Timo - ergosys - user434507 - user-voigt-timo - hopman-fun - hopman-fast
在 Ubuntu 10.10 64 位、Core i5 上的 gcc 4.4.5 -O2
hopman_fun:124.688 MB/秒 --- 8.020 秒 hopman_fast:137.552 MB/秒 --- 7.270 秒 voigt:120.192 MB/秒 --- 8.320 秒 user_voigt_timo:97.9432 MB/秒 --- 10.210 秒 提莫:120.482 MB/秒 --- 8.300 秒 用户:97.7517 MB/秒 --- 10.230 秒 ergosys:101.42 MB/秒 --- 9.860 秒Windows 7 64 位、Core i5 上的 MSVC 2010 64 位 /Ox
hopman_fun:127 MB/秒 --- 7.874 秒 hopman_fast:259 MB/秒 --- 3.861 秒 voigt:221.435 MB/秒 --- 4.516 秒 user_voigt_timo:195.695 MB/秒 --- 5.110 秒 提莫:253.165 MB/秒 --- 3.950 秒 用户:212.63 MB/秒 --- 4.703 秒 ergosys:78.0518 MB/秒 --- 12.812 秒以下是 ideone 上的一些结果和测试/计时框架http://ideone.com/XZRqp 请注意,ideone 是 32 位环境。我的两种算法都受此影响,但 hopman_fast 至少仍然具有竞争力。
请注意,对于那些不构造字符串的两个左右,我添加了以下函数模板:
template <typename T>
std::string itostr(T t)
std::string ret;
itostr(t, ret);
return ret;
现在是我的代码...首先是有趣的:
// hopman_fun
template <typename T>
T reduce2(T v)
T k = ((v * 410) >> 12) & 0x000F000F000F000Full;
return (((v - k * 10) << 8) + k);
template <typename T>
T reduce4(T v)
T k = ((v * 10486) >> 20) & 0xFF000000FFull;
return reduce2(((v - k * 100) << 16) + (k));
typedef unsigned long long ull;
inline ull reduce8(ull v)
ull k = ((v * 3518437209u) >> 45);
return reduce4(((v - k * 10000) << 32) + (k));
template <typename T>
std::string itostr(T o)
union
char str[16];
unsigned short u2[8];
unsigned u4[4];
unsigned long long u8[2];
;
unsigned v = o < 0 ? ~o + 1 : o;
u8[0] = (ull(v) * 3518437209u) >> 45;
u8[0] = (u8[0] * 28147497672ull);
u8[1] = v - u2[3] * 100000000;
u8[1] = reduce8(u8[1]);
char* f;
if (u2[3])
u2[3] = reduce2(u2[3]);
f = str + 6;
else
unsigned short* k = u4[2] ? u2 + 4 : u2 + 6;
f = *k ? (char*)k : (char*)(k + 1);
if (!*f) f++;
u4[1] |= 0x30303030;
u4[2] |= 0x30303030;
u4[3] |= 0x30303030;
if (o < 0) *--f = '-';
return std::string(f, (str + 16) - f);
然后是快速的:
// hopman_fast
struct itostr_helper
static unsigned out[10000];
itostr_helper()
for (int i = 0; i < 10000; i++)
unsigned v = i;
char * o = (char*)(out + i);
o[3] = v % 10 + '0';
o[2] = (v % 100) / 10 + '0';
o[1] = (v % 1000) / 100 + '0';
o[0] = (v % 10000) / 1000;
if (o[0]) o[0] |= 0x30;
else if (o[1] != '0') o[0] |= 0x20;
else if (o[2] != '0') o[0] |= 0x10;
else o[0] |= 0x00;
;
unsigned itostr_helper::out[10000];
itostr_helper hlp_init;
template <typename T>
std::string itostr(T o)
typedef itostr_helper hlp;
unsigned blocks[3], *b = blocks + 2;
blocks[0] = o < 0 ? ~o + 1 : o;
blocks[2] = blocks[0] % 10000; blocks[0] /= 10000;
blocks[2] = hlp::out[blocks[2]];
if (blocks[0])
blocks[1] = blocks[0] % 10000; blocks[0] /= 10000;
blocks[1] = hlp::out[blocks[1]];
blocks[2] |= 0x30303030;
b--;
if (blocks[0])
blocks[0] = hlp::out[blocks[0] % 10000];
blocks[1] |= 0x30303030;
b--;
char* f = ((char*)b);
f += 3 - (*f >> 4);
char* str = (char*)blocks;
if (o < 0) *--f = '-';
return std::string(f, (str + 12) - f);
【讨论】:
对于那些对 hopman-fun 的工作原理感兴趣但又不想费解的人,我在ideone.com/rnDxk创建了一个评论版本 我不明白第一个是如何工作的,即使是 cmets。 :D 快速的非常好,尽管它在内存使用方面有其代价。但我想 40kB 还是可以接受的。我实际上修改了自己的代码,也使用了 4 个字符组,并获得了相似的速度。 ideone.com/KbTFe 修改它以使用 uint64_t 会很困难吗?我将此代码移至 C 并将 'T' 替换为 int 类型,它可以工作,但它不适用于 uint64_t,我不知道如何自定义它。【参考方案3】:问题中提供的代码的基准数据:
在 ideone (gcc 4.3.4) 上:
字符串流:4.4 MB/s sprintf:25.0 MB/s mine (Ben Voigt):55.8 MB/s Timo:58.5 MB/s user434507:199 MB/s user434507's Ben-Timo-507 hybrid:263 MB/秒Core i7、Windows 7 64 位、8 GB RAM、Visual C++ 2010 32 位:
cl /Ox /EHsc
Core i7、Windows 7 64 位、8 GB RAM、Visual C++ 2010 64 位:
cl /Ox /EHsc
Core i7、Windows 7 64 位、8 GB RAM、cygwin gcc 4.3.4:
g++ -O3
编辑:我要添加自己的答案,但问题已关闭,所以我在这里添加。 :) 我编写了自己的算法,并设法对 Ben 的代码进行了不错的改进,尽管我只在 MSVC 2010 中对其进行了测试。我还使用与 Ben 的原始代码相同的测试设置对迄今为止提供的所有实现进行了基准测试代码。 -- 蒂莫
英特尔 Q9450,Win XP 32 位,MSVC 2010
cl /O2 /EHsc
-
const char digit_pairs[201] =
"00010203040506070809"
"10111213141516171819"
"20212223242526272829"
"30313233343536373839"
"40414243444546474849"
"50515253545556575859"
"60616263646566676869"
"70717273747576777879"
"80818283848586878889"
"90919293949596979899"
;
static const int BUFFER_SIZE = 11;
std::string itostr(int val)
char buf[BUFFER_SIZE];
char *it = &buf[BUFFER_SIZE-2];
if(val>=0)
int div = val/100;
while(div)
memcpy(it,&digit_pairs[2*(val-div*100)],2);
val = div;
it-=2;
div = val/100;
memcpy(it,&digit_pairs[2*val],2);
if(val<10)
it++;
else
int div = val/100;
while(div)
memcpy(it,&digit_pairs[-2*(val-div*100)],2);
val = div;
it-=2;
div = val/100;
memcpy(it,&digit_pairs[-2*val],2);
if(val<=-10)
it--;
*it = '-';
return std::string(it,&buf[BUFFER_SIZE]-it);
std::string itostr(unsigned int val)
char buf[BUFFER_SIZE];
char *it = (char*)&buf[BUFFER_SIZE-2];
int div = val/100;
while(div)
memcpy(it,&digit_pairs[2*(val-div*100)],2);
val = div;
it-=2;
div = val/100;
memcpy(it,&digit_pairs[2*val],2);
if(val<10)
it++;
return std::string((char*)it,(char*)&buf[BUFFER_SIZE]-(char*)it);
【讨论】:
感谢这些信息,请解释一下 gcc 速度!它非常低:( @Behrouz:确实如此。我不确定为什么 gcc 这么慢,无论是 gcc 的std::string
版本还是算术代码优化不佳。我将在最后制作另一个不会转换为std::string
的版本,看看 gcc 是否会更好。
@Timo:太酷了。我真的没想到对无符号缓冲区的更改对 VC++ 有帮助,它已经相当快了,所以它只适用于 gcc,现在 user434507 在那里提供了一个更好的版本。
我认为您应该添加一个不会转换为 std::string 的版本。通过仅更改一行代码,该函数使用 GCC 在我的机器上运行时间减少了一半。通过删除 std::string,人们就可以在 C 程序中使用这个函数。【参考方案4】:
虽然我们在这里获得的有关算法的信息非常好,但我认为这个问题是“有问题的”,我将解释我为什么这么认为:
问题要求取int
->std::string
转换的性能,而这可能在比较常用方法时会产生兴趣,例如不同的字符串流实现或 boost:: lexical_cast。但是,当要求新代码(一种专门的算法)来执行此操作时,它没有任何意义。原因是 int2string 将始终涉及来自 std::string 的堆分配,如果我们试图从我们的转换算法中挤出最后一个,我认为将这些测量值与 std 完成的堆分配混合起来是没有意义的: :细绳。如果我想要高效的转换,我会总是使用固定大小的缓冲区,而且绝对不会在堆上分配任何东西!
总结一下,我觉得时间应该是分开的:
首先,最快的(int -> 固定缓冲区)转换。 第二,(固定缓冲区 -> std::string) 复制的时机。 第三,检查std::string分配如何直接用作缓冲区,以保存复制。恕我直言,这些方面不应在同一时间混淆。
【讨论】:
int2string 将始终涉及来自 std::string 的堆分配与标准库的大多数当前实现中存在的小字符串优化不同。 最后,“输出为
std::string
”的要求被放在那里只是为了让所有提交的事情公平和一致。更快地生成std::string
结果的算法也将更快地填充预分配的缓冲区。
@Ben - 优秀的 cmets。特别是。 sm.str.opt。是我以后在判断 std.string 性能时必须记住的。【参考方案5】:
我无法在 VS 下进行测试,但这似乎比您的 g++ 代码快 10% 左右。应该是可以调的 选择的决策值是猜测。仅限int,对不起。
typedef unsigned buf_t;
static buf_t * reduce(unsigned val, buf_t * stp)
unsigned above = val / 10000;
if (above != 0)
stp = reduce(above, stp);
val -= above * 10000;
buf_t digit = val / 1000;
*stp++ = digit + '0';
val -= digit * 1000;
digit = val / 100;
*stp++ = digit + '0';
val -= digit * 100;
digit = val / 10;
*stp++ = digit + '0';
val -= digit * 10;
*stp++ = val + '0';
return stp;
std::string itostr(int input)
buf_t buf[16];
if(input == INT_MIN)
char buf2[16];
std::sprintf(buf2, "%d", input);
return std::string(buf2);
// handle negative
unsigned val = input;
if(input < 0)
val = -input;
buf[0] = '0';
buf_t* endp = reduce(val, buf+1);
*endp = 127;
buf_t * stp = buf+1;
while (*stp == '0')
stp++;
if (stp == endp)
stp--;
if (input < 0)
stp--;
*stp = '-';
return std::string(stp, endp);
【讨论】:
带有无符号变体:ideone.com/pswq9。似乎将缓冲区类型从char
更改为 unsigned
会在我的代码中产生类似的速度提升,至少在 gcc/ideone ideone.com/uthKK 上是这样。我明天在VS上测试。【参考方案6】:
更新了user2985907的答案... modp_ufast ...
Integer To String Test (Type 1)
[modp_ufast]Numbers: 240000000 Total: 657777786 Time: 1.1633sec Rate:206308473.0686nums/sec
[sprintf] Numbers: 240000000 Total: 657777786 Time: 24.3629sec Rate: 9851045.8556nums/sec
[karma] Numbers: 240000000 Total: 657777786 Time: 5.2389sec Rate: 45810870.7171nums/sec
[strtk] Numbers: 240000000 Total: 657777786 Time: 3.3126sec Rate: 72450283.7492nums/sec
[so ] Numbers: 240000000 Total: 657777786 Time: 3.0828sec Rate: 77852152.8820nums/sec
[timo ] Numbers: 240000000 Total: 657777786 Time: 4.7349sec Rate: 50687912.9889nums/sec
[voigt] Numbers: 240000000 Total: 657777786 Time: 5.1689sec Rate: 46431985.1142nums/sec
[hopman] Numbers: 240000000 Total: 657777786 Time: 4.6169sec Rate: 51982554.6497nums/sec
Press any key to continue . . .
Integer To String Test(Type 2)
[modp_ufast]Numbers: 240000000 Total: 660000000 Time: 0.5072sec Rate:473162716.4618nums/sec
[sprintf] Numbers: 240000000 Total: 660000000 Time: 22.3483sec Rate: 10739062.9383nums/sec
[karma] Numbers: 240000000 Total: 660000000 Time: 4.2471sec Rate: 56509024.3035nums/sec
[strtk] Numbers: 240000000 Total: 660000000 Time: 2.1683sec Rate:110683636.7123nums/sec
[so ] Numbers: 240000000 Total: 660000000 Time: 2.7133sec Rate: 88454602.1423nums/sec
[timo ] Numbers: 240000000 Total: 660000000 Time: 2.8030sec Rate: 85623453.3872nums/sec
[voigt] Numbers: 240000000 Total: 660000000 Time: 3.4019sec Rate: 70549286.7776nums/sec
[hopman] Numbers: 240000000 Total: 660000000 Time: 2.7849sec Rate: 86178023.8743nums/sec
Press any key to continue . . .
Integer To String Test (type 3)
[modp_ufast]Numbers: 240000000 Total: 505625000 Time: 1.6482sec Rate:145610315.7819nums/sec
[sprintf] Numbers: 240000000 Total: 505625000 Time: 20.7064sec Rate: 11590618.6109nums/sec
[karma] Numbers: 240000000 Total: 505625000 Time: 4.3036sec Rate: 55767734.3570nums/sec
[strtk] Numbers: 240000000 Total: 505625000 Time: 2.9297sec Rate: 81919227.9275nums/sec
[so ] Numbers: 240000000 Total: 505625000 Time: 3.0278sec Rate: 79266003.8158nums/sec
[timo ] Numbers: 240000000 Total: 505625000 Time: 4.0631sec Rate: 59068204.3266nums/sec
[voigt] Numbers: 240000000 Total: 505625000 Time: 4.5616sec Rate: 52613393.0285nums/sec
[hopman] Numbers: 240000000 Total: 505625000 Time: 4.1248sec Rate: 58184194.4569nums/sec
Press any key to continue . . .
int ufast_utoa10(unsigned int value, char* str)
#define JOIN(N) N "0", N "1", N "2", N "3", N "4", N "5", N "6", N "7", N "8", N "9"
#define JOIN2(N) JOIN(N "0"), JOIN(N "1"), JOIN(N "2"), JOIN(N "3"), JOIN(N "4"), \
JOIN(N "5"), JOIN(N "6"), JOIN(N "7"), JOIN(N "8"), JOIN(N "9")
#define JOIN3(N) JOIN2(N "0"), JOIN2(N "1"), JOIN2(N "2"), JOIN2(N "3"), JOIN2(N "4"), \
JOIN2(N "5"), JOIN2(N "6"), JOIN2(N "7"), JOIN2(N "8"), JOIN2(N "9")
#define JOIN4 JOIN3("0"), JOIN3("1"), JOIN3("2"), JOIN3("3"), JOIN3("4"), \
JOIN3("5"), JOIN3("6"), JOIN3("7"), JOIN3("8"), JOIN3("9")
#define JOIN5(N) JOIN(N), JOIN(N "1"), JOIN(N "2"), JOIN(N "3"), JOIN(N "4"), \
JOIN(N "5"), JOIN(N "6"), JOIN(N "7"), JOIN(N "8"), JOIN(N "9")
#define JOIN6 JOIN5(), JOIN5("1"), JOIN5("2"), JOIN5("3"), JOIN5("4"), \
JOIN5("5"), JOIN5("6"), JOIN5("7"), JOIN5("8"), JOIN5("9")
#define F(N) ((N) >= 100 ? 3 : (N) >= 10 ? 2 : 1)
#define F10(N) F(N),F(N+1),F(N+2),F(N+3),F(N+4),F(N+5),F(N+6),F(N+7),F(N+8),F(N+9)
#define F100(N) F10(N),F10(N+10),F10(N+20),F10(N+30),F10(N+40),\
F10(N+50),F10(N+60),F10(N+70),F10(N+80),F10(N+90)
static const short offsets[] = F100(0), F100(100), F100(200), F100(300), F100(400),
F100(500), F100(600), F100(700), F100(800), F100(900);
static const char table1[][4] = JOIN("") ;
static const char table2[][4] = JOIN2("") ;
static const char table3[][4] = JOIN3("") ;
static const char table4[][5] = JOIN4 ;
static const char table5[][4] = JOIN6 ;
#undef JOIN
#undef JOIN2
#undef JOIN3
#undef JOIN4
char *wstr;
int remains[2];
unsigned int v2;
if (value >= 100000000)
v2 = value / 10000;
remains[0] = value - v2 * 10000;
value = v2;
v2 = value / 10000;
remains[1] = value - v2 * 10000;
value = v2;
wstr = str;
if (value >= 1000)
*(__int32 *) wstr = *(__int32 *) table4[value];
wstr += 4;
else
*(__int32 *) wstr = *(__int32 *) table5[value];
wstr += offsets[value];
*(__int32 *) wstr = *(__int32 *) table4[remains[1]];
wstr += 4;
*(__int32 *) wstr = *(__int32 *) table4[remains[0]];
wstr += 4;
*wstr = 0;
return (wstr - str);
else if (value >= 10000)
v2 = value / 10000;
remains[0] = value - v2 * 10000;
value = v2;
wstr = str;
if (value >= 1000)
*(__int32 *) wstr = *(__int32 *) table4[value];
wstr += 4;
*(__int32 *) wstr = *(__int32 *) table4[remains[0]];
wstr += 4;
*wstr = 0;
return 8;
else
*(__int32 *) wstr = *(__int32 *) table5[value];
wstr += offsets[value];
*(__int32 *) wstr = *(__int32 *) table4[remains[0]];
wstr += 4;
*wstr = 0;
return (wstr - str);
else
if (value >= 1000)
*(__int32 *) str = *(__int32 *) table4[value];
str += 4;
*str = 0;
return 4;
else if (value >= 100)
*(__int32 *) str = *(__int32 *) table3[value];
return 3;
else if (value >= 10)
*(__int16 *) str = *(__int16 *) table2[value];
str += 2;
*str = 0;
return 2;
else
*(__int16 *) str = *(__int16 *) table1[value];
return 1;
int ufast_itoa10(int value, char* str)
if (value < 0) *(str++) = '-';
return ufast_utoa10(-value, str) + 1;
else return ufast_utoa10(value, str);
void ufast_test()
print_mode("[modp_ufast]");
std::string s;
s.reserve(32);
std::size_t total_length = 0;
strtk::util::timer t;
t.start();
char buf[128];
int len;
for (int i = (-max_i2s / 2); i < (max_i2s / 2); ++i)
#ifdef enable_test_type01
s.resize(ufast_itoa10(((i & 1) ? i : -i), const_cast<char*>(s.c_str())));
total_length += s.size();
#endif
#ifdef enable_test_type02
s.resize(ufast_itoa10(max_i2s + i, const_cast<char*>(s.c_str())));
total_length += s.size();
#endif
#ifdef enable_test_type03
s.resize(ufast_itoa10(randval[(max_i2s + i) & 1023], const_cast<char*>(s.c_str())));
total_length += s.size();
#endif
t.stop();
printf("Numbers:%10lu\tTotal:%12lu\tTime:%8.4fsec\tRate:%14.4fnums/sec\n",
static_cast<unsigned long>(3 * max_i2s),
static_cast<unsigned long>(total_length),
t.time(),
(3.0 * max_i2s) / t.time());
【讨论】:
你从来没有把它放到字符串中。另外我不知道为什么你对其他人的代码的结果这么低,你的 CPU 并不慢。 modp_ufast 有一个错误,它返回 10 而不是 1000000,返回 19 而不是 1090000 等等,直到 11000000。 修改后的 ufast 返回无效值(在几次错误后停止)。Mismatch found: Generated: -99 Reference: -9099999 Mismatch found: Generated: -99 Reference: -9099998 Mismatch found: Generated: -99 Reference: -9099997
这里有一个带有基准测试的更便携的版本:github.com/fmtlib/format-benchmark/blob/master/src/u2985907.h【参考方案7】:
我已经坐了一段时间,终于有时间发布它。
与一次双字相比,多了几个方法hopman_fast。结果适用于 GCC 的短字符串优化 std::string,否则性能差异会被写时复制字符串管理代码的开销所掩盖。吞吐量的测量方式与本主题中的其他地方相同,循环计数是在将输出缓冲区复制到字符串之前的代码的原始序列化部分。
HOPMAN_FAST - performance reference
TM_CPP, TM_VEC - scalar and vector versions of Terje Mathisen algorithm
WM_VEC - intrinsics implementation of Wojciech Mula's vector algorithm
AK_BW - word-at-a-time routine with a jump table that fills a buffer in reverse
AK_FW - forward-stepping word-at-a-time routine with a jump table in assembly
AK_UNROLLED - generic word-at-a-time routine that uses an unrolled loop
编译时开关:
-DVSTRING - 为旧的 GCC 设置启用 SSO 字符串 -DBSR1 - 启用快速 log10 -DRDTSC - 启用循环计数器
#include <cstdio>
#include <iostream>
#include <climits>
#include <sstream>
#include <algorithm>
#include <cstring>
#include <limits>
#include <ctime>
#include <stdint.h>
#include <x86intrin.h>
/* Uncomment to run */
// #define HOPMAN_FAST
// #define TM_CPP
// #define TM_VEC
// #define WM_VEC
// #define AK_UNROLLED
// #define AK_BW
// #define AK_FW
using namespace std;
#ifdef VSTRING
#include <ext/vstring.h>
typedef __gnu_cxx::__vstring string_type;
#else
typedef string string_type;
#endif
namespace detail
#ifdef __GNUC__
#define ALIGN(N) __attribute__ ((aligned(N)))
#define PACK __attribute__ ((packed))
inline size_t num_digits(unsigned u)
struct
uint32_t count;
uint32_t max;
static digits[32] ALIGN(64) =
1, 9 , 1, 9 , 1, 9 , 1, 9 ,
2, 99 , 2, 99 , 2, 99 ,
3, 999 , 3, 999 , 3, 999 ,
4, 9999 , 4, 9999 , 4, 9999 , 4, 9999 ,
5, 99999 , 5, 99999 , 5, 99999 ,
6, 999999 , 6, 999999 , 6, 999999 ,
7, 9999999 , 7, 9999999 , 7, 9999999 , 7, 9999999 ,
8, 99999999 , 8, 99999999 , 8, 99999999 ,
9, 999999999 , 9, 999999999 , 9, 999999999 ,
10, UINT_MAX , 10, UINT_MAX
;
#if (defined(i386) || defined(__x86_64__)) && (defined(BSR1) || defined(BSR2))
size_t l = u;
#if defined(BSR1)
__asm__ __volatile__ (
"bsrl %k0, %k0 \n\t"
"shlq $32, %q1 \n\t"
"movq %c2(,%0,8), %0\n\t"
"cmpq %0, %q1 \n\t"
"seta %b1 \n\t"
"addl %1, %k0 \n\t"
: "+r" (l), "+r"(u)
: "i"(digits)
: "cc"
);
return l;
#else
__asm__ __volatile__ ( "bsr %0, %0;" : "+r" (l) );
return digits[l].count + ( u > digits[l].max );
#endif
#else
size_t l = (u != 0) ? 31 - __builtin_clz(u) : 0;
return digits[l].count + ( u > digits[l].max );
#endif
#else
inline unsigned msb_u32(unsigned x)
static const unsigned bval[] = 0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4 ;
unsigned base = 0;
if (x & (unsigned) 0xFFFF0000) base += 32/2; x >>= 32/2;
if (x & (unsigned) 0x0000FF00) base += 32/4; x >>= 32/4;
if (x & (unsigned) 0x000000F0) base += 32/8; x >>= 32/8;
return base + bval[x];
inline size_t num_digits(unsigned x)
static const unsigned powertable[] =
0,10,100,1000,10000,100000,1000000,10000000,100000000, 1000000000 ;
size_t lg_ten = msb_u32(x) * 1233 >> 12;
size_t adjust = (x >= powertable[lg_ten]);
return lg_ten + adjust;
#endif /* __GNUC__ */
struct CharBuffer
class reverse_iterator : public iterator<random_access_iterator_tag, char>
char* m_p;
public:
reverse_iterator(char* p) : m_p(p - 1)
reverse_iterator operator++() return --m_p;
reverse_iterator operator++(int) return m_p--;
char operator*() const return *m_p;
bool operator==( reverse_iterator it) const return m_p == it.m_p;
bool operator!=( reverse_iterator it) const return m_p != it.m_p;
difference_type operator-( reverse_iterator it) const return it.m_p - m_p;
;
;
union PairTable
char c[2];
unsigned short u;
PACK table[100] ALIGN(1024) =
'0','0','0','1','0','2','0','3','0','4','0','5','0','6','0','7','0','8','0','9',
'1','0','1','1','1','2','1','3','1','4','1','5','1','6','1','7','1','8','1','9',
'2','0','2','1','2','2','2','3','2','4','2','5','2','6','2','7','2','8','2','9',
'3','0','3','1','3','2','3','3','3','4','3','5','3','6','3','7','3','8','3','9',
'4','0','4','1','4','2','4','3','4','4','4','5','4','6','4','7','4','8','4','9',
'5','0','5','1','5','2','5','3','5','4','5','5','5','6','5','7','5','8','5','9',
'6','0','6','1','6','2','6','3','6','4','6','5','6','6','6','7','6','8','6','9',
'7','0','7','1','7','2','7','3','7','4','7','5','7','6','7','7','7','8','7','9',
'8','0','8','1','8','2','8','3','8','4','8','5','8','6','8','7','8','8','8','9',
'9','0','9','1','9','2','9','3','9','4','9','5','9','6','9','7','9','8','9','9'
;
// namespace detail
struct progress_timer
clock_t c;
progress_timer() : c(clock())
int elapsed() return clock() - c;
~progress_timer()
clock_t d = clock() - c;
cout << d / CLOCKS_PER_SEC << "."
<< (((d * 1000) / CLOCKS_PER_SEC) % 1000 / 100)
<< (((d * 1000) / CLOCKS_PER_SEC) % 100 / 10)
<< (((d * 1000) / CLOCKS_PER_SEC) % 10)
<< " s" << endl;
;
#ifdef HOPMAN_FAST
namespace hopman_fast
static unsigned long cpu_cycles = 0;
struct itostr_helper
static ALIGN(1024) unsigned out[10000];
itostr_helper()
for (int i = 0; i < 10000; i++)
unsigned v = i;
char * o = (char*)(out + i);
o[3] = v % 10 + '0';
o[2] = (v % 100) / 10 + '0';
o[1] = (v % 1000) / 100 + '0';
o[0] = (v % 10000) / 1000;
if (o[0]) o[0] |= 0x30;
else if (o[1] != '0') o[0] |= 0x20;
else if (o[2] != '0') o[0] |= 0x10;
else o[0] |= 0x00;
;
unsigned itostr_helper::out[10000];
itostr_helper hlp_init;
template <typename T>
string_type itostr(T o)
typedef itostr_helper hlp;
#ifdef RDTSC
long first_clock = __rdtsc();
#endif
unsigned blocks[3], *b = blocks + 2;
blocks[0] = o < 0 ? ~o + 1 : o;
blocks[2] = blocks[0] % 10000; blocks[0] /= 10000;
blocks[2] = hlp::out[blocks[2]];
if (blocks[0])
blocks[1] = blocks[0] % 10000; blocks[0] /= 10000;
blocks[1] = hlp::out[blocks[1]];
blocks[2] |= 0x30303030;
b--;
if (blocks[0])
blocks[0] = hlp::out[blocks[0] % 10000];
blocks[1] |= 0x30303030;
b--;
char* f = ((char*)b);
f += 3 - (*f >> 4);
char* str = (char*)blocks;
if (o < 0) *--f = '-';
str += 12;
#ifdef RDTSC
cpu_cycles += __rdtsc() - first_clock;
#endif
return string_type(f, str);
unsigned long cycles() return cpu_cycles;
void reset() cpu_cycles = 0;
#endif
namespace ak
#ifdef AK_UNROLLED
namespace unrolled
static unsigned long cpu_cycles = 0;
template <typename value_type> class Proxy
static const size_t MaxValueSize = 16;
static inline char* generate(int value, char* buffer)
union char* pc; unsigned short* pu; b = buffer + MaxValueSize ;
unsigned u, v = value < 0 ? unsigned(~value) + 1 : value;
*--b.pu = detail::table[v % 100].u; u = v;
if ((v /= 100))
*--b.pu = detail::table[v % 100].u; u = v;
if ((v /= 100))
*--b.pu = detail::table[v % 100].u; u = v;
if ((v /= 100))
*--b.pu = detail::table[v % 100].u; u = v;
if ((v /= 100))
*--b.pu = detail::table[v % 100].u; u = v;
*(b.pc -= (u >= 10)) = '-';
return b.pc + (value >= 0);
static inline char* generate(unsigned value, char* buffer)
union char* pc; unsigned short* pu; b = buffer + MaxValueSize ;
unsigned u, v = value;
*--b.pu = detail::table[v % 100].u; u = v;
if ((v /= 100))
*--b.pu = detail::table[v % 100].u; u = v;
if ((v /= 100))
*--b.pu = detail::table[v % 100].u; u = v;
if ((v /= 100))
*--b.pu = detail::table[v % 100].u; u = v;
if ((v /= 100))
*--b.pu = detail::table[v % 100].u; u = v;
return b.pc + (u < 10);
public:
static inline string_type convert(value_type v)
char buf[MaxValueSize];
#ifdef RDTSC
long first_clock = __rdtsc();
#endif
char* p = generate(v, buf);
char* e = buf + MaxValueSize;
#ifdef RDTSC
cpu_cycles += __rdtsc() - first_clock;
#endif
return string_type(p, e);
;
string_type itostr(int i) return Proxy<int>::convert(i);
string_type itostr(unsigned i) return Proxy<unsigned>::convert(i);
unsigned long cycles() return cpu_cycles;
void reset() cpu_cycles = 0;
#endif
#if defined(AK_BW)
namespace bw
static unsigned long cpu_cycles = 0;
typedef uint64_t u_type;
template <typename value_type> class Proxy
static inline void generate(unsigned v, size_t len, char* buffer)
u_type u = v;
switch(len)
default: u = (v * 1374389535ULL) >> 37; *(uint16_t*)(buffer + 8) = detail::table[v -= 100 * u].u;
case 8: v = (u * 1374389535ULL) >> 37; *(uint16_t*)(buffer + 6) = detail::table[u -= 100 * v].u;
case 6: u = (v * 1374389535ULL) >> 37; *(uint16_t*)(buffer + 4) = detail::table[v -= 100 * u].u;
case 4: v = (u * 167773) >> 24; *(uint16_t*)(buffer + 2) = detail::table[u -= 100 * v].u;
case 2: *(uint16_t*)buffer = detail::table[v].u;
case 0: return;
case 9: u = (v * 1374389535ULL) >> 37; *(uint16_t*)(buffer + 7) = detail::table[v -= 100 * u].u;
case 7: v = (u * 1374389535ULL) >> 37; *(uint16_t*)(buffer + 5) = detail::table[u -= 100 * v].u;
case 5: u = (v * 1374389535ULL) >> 37; *(uint16_t*)(buffer + 3) = detail::table[v -= 100 * u].u;
case 3: v = (u * 167773) >> 24; *(uint16_t*)(buffer + 1) = detail::table[u -= 100 * v].u;
case 1: *buffer = v + 0x30;
public:
static inline string_type convert(bool neg, unsigned val)
char buf[16];
#ifdef RDTSC
long first_clock = __rdtsc();
#endif
size_t len = detail::num_digits(val);
buf[0] = '-';
char* e = buf + neg;
generate(val, len, e);
e += len;
#ifdef RDTSC
cpu_cycles += __rdtsc() - first_clock;
#endif
return string_type(buf, e);
;
string_type itostr(int i) return Proxy<int>::convert(i < 0, i < 0 ? unsigned(~i) + 1 : i);
string_type itostr(unsigned i) return Proxy<unsigned>::convert(false, i);
unsigned long cycles() return cpu_cycles;
void reset() cpu_cycles = 0;
#endif
#if defined(AK_FW)
namespace fw
static unsigned long cpu_cycles = 0;
typedef uint32_t u_type;
template <typename value_type> class Proxy
static inline void generate(unsigned v, size_t len, char* buffer)
#if defined(__GNUC__) && defined(__x86_64__)
uint16_t w;
uint32_t u;
__asm__ __volatile__ (
"jmp %*T%=(,%3,8) \n\t"
"T%=: .quad L0%= \n\t"
" .quad L1%= \n\t"
" .quad L2%= \n\t"
" .quad L3%= \n\t"
" .quad L4%= \n\t"
" .quad L5%= \n\t"
" .quad L6%= \n\t"
" .quad L7%= \n\t"
" .quad L8%= \n\t"
" .quad L9%= \n\t"
" .quad L10%= \n\t"
"L10%=: \n\t"
" imulq $1441151881, %q0, %q1\n\t"
" shrq $57, %q1 \n\t"
" movw %c5(,%q1,2), %w2 \n\t"
" imull $100000000, %1, %1 \n\t"
" subl %1, %0 \n\t"
" movw %w2, (%4) \n\t"
"L8%=: \n\t"
" imulq $1125899907, %q0, %q1\n\t"
" shrq $50, %q1 \n\t"
" movw %c5(,%q1,2), %w2 \n\t"
" imull $1000000, %1, %1 \n\t"
" subl %1, %0 \n\t"
" movw %w2, -8(%4,%3) \n\t"
"L6%=: \n\t"
" imulq $429497, %q0, %q1 \n\t"
" shrq $32, %q1 \n\t"
" movw %c5(,%q1,2), %w2 \n\t"
" imull $10000, %1, %1 \n\t"
" subl %1, %0 \n\t"
" movw %w2, -6(%4,%3) \n\t"
"L4%=: \n\t"
" imull $167773, %0, %1 \n\t"
" shrl $24, %1 \n\t"
" movw %c5(,%q1,2), %w2 \n\t"
" imull $100, %1, %1 \n\t"
" subl %1, %0 \n\t"
" movw %w2, -4(%4,%3) \n\t"
"L2%=: \n\t"
" movw %c5(,%q0,2), %w2 \n\t"
" movw %w2, -2(%4,%3) \n\t"
"L0%=: jmp 1f \n\t"
"L9%=: \n\t"
" imulq $1801439851, %q0, %q1\n\t"
" shrq $54, %q1 \n\t"
" movw %c5(,%q1,2), %w2 \n\t"
" imull $10000000, %1, %1 \n\t"
" subl %1, %0 \n\t"
" movw %w2, (%4) \n\t"
"L7%=: \n\t"
" imulq $43980466, %q0, %q1 \n\t"
" shrq $42, %q1 \n\t"
" movw %c5(,%q1,2), %w2 \n\t"
" imull $100000, %1, %1 \n\t"
" subl %1, %0 \n\t"
" movw %w2, -7(%4,%3) \n\t"
"L5%=: \n\t"
" imulq $268436, %q0, %q1 \n\t"
" shrq $28, %q1 \n\t"
" movw %c5(,%q1,2), %w2 \n\t"
" imull $1000, %1, %1 \n\t"
" subl %1, %0 \n\t"
" movw %w2, -5(%4,%3) \n\t"
"L3%=: \n\t"
" imull $6554, %0, %1 \n\t"
" shrl $15, %1 \n\t"
" andb $254, %b1 \n\t"
" movw %c5(,%q1), %w2 \n\t"
" leal (%1,%1,4), %1 \n\t"
" subl %1, %0 \n\t"
" movw %w2, -3(%4,%3) \n\t"
"L1%=: \n\t"
" addl $48, %0 \n\t"
" movb %b0, -1(%4,%3) \n\t"
"1: \n\t"
: "+r"(v), "=&q"(u), "=&r"(w)
: "r"(len), "r"(buffer), "i"(detail::table)
: "memory", "cc"
);
#else
u_type u;
switch(len)
default: u = (v * 1441151881ULL) >> 57; *(uint16_t*)(buffer) = detail::table[u].u; v -= u * 100000000;
case 8: u = (v * 1125899907ULL) >> 50; *(uint16_t*)(buffer + len - 8) = detail::table[u].u; v -= u * 1000000;
case 6: u = (v * 429497ULL) >> 32; *(uint16_t*)(buffer + len - 6) = detail::table[u].u; v -= u * 10000;
case 4: u = (v * 167773) >> 24; *(uint16_t*)(buffer + len - 4) = detail::table[u].u; v -= u * 100;
case 2: *(uint16_t*)(buffer + len - 2) = detail::table[v].u;
case 0: return;
case 9: u = (v * 1801439851ULL) >> 54; *(uint16_t*)(buffer) = detail::table[u].u; v -= u * 10000000;
case 7: u = (v * 43980466ULL) >> 42; *(uint16_t*)(buffer + len - 7) = detail::table[u].u; v -= u * 100000;
case 5: u = (v * 268436ULL) >> 28; *(uint16_t*)(buffer + len - 5) = detail::table[u].u; v -= u * 1000;
case 3: u = (v * 6554) >> 16; *(uint16_t*)(buffer + len - 3) = detail::table[u].u; v -= u * 10;
case 1: *(buffer + len - 1) = v + 0x30;
#endif
public:
static inline string_type convert(bool neg, unsigned val)
char buf[16];
#ifdef RDTSC
long first_clock = __rdtsc();
#endif
size_t len = detail::num_digits(val);
if (neg) buf[0] = '-';
char* e = buf + len + neg;
generate(val, len, buf + neg);
#ifdef RDTSC
cpu_cycles += __rdtsc() - first_clock;
#endif
return string_type(buf, e);
;
string_type itostr(int i) return Proxy<int>::convert(i < 0, i < 0 ? unsigned(~i) + 1 : i);
string_type itostr(unsigned i) return Proxy<unsigned>::convert(false, i);
unsigned long cycles() return cpu_cycles;
void reset() cpu_cycles = 0;
#endif
// ak
namespace wm
#ifdef WM_VEC
#if defined(__GNUC__) && defined(__x86_64__)
namespace vec
static unsigned long cpu_cycles = 0;
template <typename value_type> class Proxy
static inline unsigned generate(unsigned v, char* buf)
static struct
unsigned short mul_10[8];
unsigned short div_const[8];
unsigned short shl_const[8];
unsigned char to_ascii[16];
ALIGN(64) bits =
// mul_10
10, 10, 10, 10, 10, 10, 10, 10
,
// div_const
8389, 5243, 13108, 0x8000, 8389, 5243, 13108, 0x8000
,
// shl_const
1 << (16 - (23 + 2 - 16)),
1 << (16 - (19 + 2 - 16)),
1 << (16 - 1 - 2),
1 << (15),
1 << (16 - (23 + 2 - 16)),
1 << (16 - (19 + 2 - 16)),
1 << (16 - 1 - 2),
1 << (15)
,
// to_ascii
'0', '0', '0', '0', '0', '0', '0', '0',
'0', '0', '0', '0', '0', '0', '0', '0'
;
unsigned x, y, l;
x = (v * 1374389535ULL) >> 37;
y = v;
l = 0;
if (x)
unsigned div = 0xd1b71759;
unsigned mul = 55536;
__m128i z, m, a, o;
y -= 100 * x;
z = _mm_cvtsi32_si128(x);
m = _mm_load_si128((__m128i*)bits.mul_10);
o = _mm_mul_epu32( z, _mm_cvtsi32_si128(div));
z = _mm_add_epi32( z, _mm_mul_epu32( _mm_cvtsi32_si128(mul), _mm_srli_epi64( o, 45) ) );
z = _mm_slli_epi64( _mm_shuffle_epi32( _mm_unpacklo_epi16(z, z), 5 ), 2 );
a = _mm_load_si128((__m128i*)bits.to_ascii);
z = _mm_mulhi_epu16( _mm_mulhi_epu16( z, *(__m128i*)bits.div_const ), *(__m128i*)bits.shl_const );
z = _mm_sub_epi16( z, _mm_slli_epi64( _mm_mullo_epi16( m, z ), 16 ) );
z = _mm_add_epi8( _mm_packus_epi16( z, _mm_xor_si128(o, o) ), a );
x = __builtin_ctz( ~_mm_movemask_epi8( _mm_cmpeq_epi8( a, z ) ) );
l = 8 - x;
uint64_t q = _mm_cvtsi128_si64(z) >> (x * 8);
*(uint64_t*)buf = q;
buf += l;
x = 1;
v = (y * 6554) >> 16;
l += 1 + (x | (v != 0));
*(unsigned short*)buf = 0x30 + ((l > 1) ? ((0x30 + y - v * 10) << 8) + v : y);
return l;
public:
static inline string_type convert(bool neg, unsigned val)
char buf[16];
#ifdef RDTSC
long first_clock = __rdtsc();
#endif
buf[0] = '-';
unsigned len = generate(val, buf + neg);
char* e = buf + len + neg;
#ifdef RDTSC
cpu_cycles += __rdtsc() - first_clock;
#endif
return string_type(buf, e);
;
inline string_type itostr(int i) return Proxy<int>::convert(i < 0, i < 0 ? unsigned(~i) + 1 : i);
inline string_type itostr(unsigned i) return Proxy<unsigned>::convert(false, i);
unsigned long cycles() return cpu_cycles;
void reset() cpu_cycles = 0;
#endif
#endif
// wm
namespace tmn
#ifdef TM_CPP
namespace cpp
static unsigned long cpu_cycles = 0;
template <typename value_type> class Proxy
static inline void generate(unsigned v, char* buffer)
unsigned const f1_10000 = (1 << 28) / 10000;
unsigned tmplo, tmphi;
unsigned lo = v % 100000;
unsigned hi = v / 100000;
tmplo = lo * (f1_10000 + 1) - (lo >> 2);
tmphi = hi * (f1_10000 + 1) - (hi >> 2);
unsigned mask = 0x0fffffff;
unsigned shift = 28;
for(size_t i = 0; i < 5; i++)
buffer[i + 0] = '0' + (char)(tmphi >> shift);
buffer[i + 5] = '0' + (char)(tmplo >> shift);
tmphi = (tmphi & mask) * 5;
tmplo = (tmplo & mask) * 5;
mask >>= 1;
shift--;
public:
static inline string_type convert(bool neg, unsigned val)
#ifdef RDTSC
long first_clock = __rdtsc();
#endif
char buf[16];
size_t len = detail::num_digits(val);
char* e = buf + 11;
generate(val, buf + 1);
buf[10 - len] = '-';
len += neg;
char* b = e - len;
#ifdef RDTSC
cpu_cycles += __rdtsc() - first_clock;
#endif
return string_type(b, e);
;
string_type itostr(int i) return Proxy<int>::convert(i < 0, i < 0 ? unsigned(~i) + 1 : i);
string_type itostr(unsigned i) return Proxy<unsigned>::convert(false, i);
unsigned long cycles() return cpu_cycles;
void reset() cpu_cycles = 0;
#endif
#ifdef TM_VEC
namespace vec
static unsigned long cpu_cycles = 0;
template <typename value_type> class Proxy
static inline unsigned generate(unsigned val, char* buffer)
static struct
unsigned char mul_10[16];
unsigned char to_ascii[16];
unsigned char gather[16];
unsigned char shift[16];
ALIGN(64) bits =
10,0,0,0,10,0,0,0,10,0,0,0,10,0,0,0 ,
'0','0','0','0','0','0','0','0','0','0','0','0','0','0','0','0' ,
3,5,6,7,9,10,11,13,14,15,0,0,0,0,0,0 ,
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
;
unsigned u = val / 1000000;
unsigned l = val - u * 1000000;
__m128i x, h, f, m, n;
n = _mm_load_si128((__m128i*)bits.mul_10);
x = _mm_set_epi64x( l, u );
h = _mm_mul_epu32( x, _mm_set1_epi32(4294968) );
x = _mm_sub_epi64( x, _mm_srli_epi64( _mm_mullo_epi32( h, _mm_set1_epi32(1000) ), 32 ) );
f = _mm_set1_epi32((1 << 28) / 1000 + 1);
m = _mm_srli_epi32( _mm_cmpeq_epi32(m, m), 4 );
x = _mm_shuffle_epi32( _mm_blend_epi16( x, h, 204 ), 177 );
f = _mm_sub_epi32( _mm_mullo_epi32(f, x), _mm_srli_epi32(x, 2) );
h = _mm_load_si128((__m128i*)bits.to_ascii);
x = _mm_srli_epi32(f, 28);
f = _mm_mullo_epi32( _mm_and_si128( f, m ), n );
x = _mm_or_si128( x, _mm_slli_epi32(_mm_srli_epi32(f, 28), 8) );
f = _mm_mullo_epi32( _mm_and_si128( f, m ), n );
x = _mm_or_si128( x, _mm_slli_epi32(_mm_srli_epi32(f, 28), 16) );
f = _mm_mullo_epi32( _mm_and_si128( f, m ), n );
x = _mm_or_si128( x, _mm_slli_epi32(_mm_srli_epi32(f, 28), 24) );
x = _mm_add_epi8( _mm_shuffle_epi8(x, *(__m128i*)bits.gather), h );
l = __builtin_ctz( ~_mm_movemask_epi8( _mm_cmpeq_epi8( h, x ) ) | (1 << 9) );
x = _mm_shuffle_epi8( x, _mm_add_epi8(*(__m128i*)bits.shift, _mm_set1_epi8(l) ) );
_mm_store_si128( (__m128i*)buffer, x );
return 10 - l;
public:
static inline string_type convert(bool neg, unsigned val)
#ifdef RDTSC
long first_clock = __rdtsc();
#endif
char arena[32];
char* buf = (char*)((uintptr_t)(arena + 16) & ~(uintptr_t)0xf);
*(buf - 1)= '-';
unsigned len = generate(val, buf) + neg;
buf -= neg;
char* end = buf + len;
#ifdef RDTSC
cpu_cycles += __rdtsc() - first_clock;
#endif
return string_type(buf, end);
;
string_type itostr(int i) return Proxy<int>::convert(i < 0, i < 0 ? unsigned(~i) + 1 : i);
string_type itostr(unsigned i) return Proxy<unsigned>::convert(false, i);
unsigned long cycles() return cpu_cycles;
void reset() cpu_cycles = 0;
#endif
bool fail(string in, string_type out)
cout << "failure: " << in << " => " << out << endl;
return false;
#define TEST(x, n) \
stringstream ss; \
string_type s = n::itostr(x); \
ss << (long long)x; \
if (::strcmp(ss.str().c_str(), s.c_str())) \
passed = fail(ss.str(), s); \
break; \
#define test(x) \
passed = true; \
if (0 && passed) \
char c = CHAR_MIN; \
do \
TEST(c, x); \
while (c++ != CHAR_MAX); \
if (!passed) cout << #x << " failed char!!!" << endl; \
\
if (0 && passed) \
short c = numeric_limits<short>::min(); \
do \
TEST(c, x); \
while (c++ != numeric_limits<short>::max()); \
if (!passed) cout << #x << " failed short!!!" << endl; \
\
if (passed) \
int c = numeric_limits<int>::min(); \
do \
TEST(c, x); \
while ((c += 100000) < numeric_limits<int>::max() - 100000); \
if (!passed) cout << #x << " failed int!!!" << endl; \
\
if (passed) \
unsigned c = numeric_limits<unsigned>::max(); \
do \
TEST(c, x); \
while ((c -= 100000) > 100000); \
if (!passed) cout << #x << " failed unsigned int!!!" << endl; \
\
#define time(x, N) \
if (passed) \
static const int64_t limits[] = \
0, 10, 100, 1000, 10000, 100000, \
1000000, 10000000, 100000000, 1000000000, 10000000000ULL ; \
long passes = 0; \
cout << #x << ": "; \
progress_timer t; \
uint64_t s = 0; \
if (do_time) \
for (int n = 0; n < N1; n++) \
int i = 0; \
while (i < N2) \
int v = ((NM - i) % limits[N]) | (limits[N] / 10); \
int w = x::itostr(v).size() + \
x::itostr(-v).size(); \
i += w * mult; \
passes++; \
\
s += i / mult; \
\
\
k += s; \
cout << N << " digits: " \
<< s / double(t.elapsed()) * CLOCKS_PER_SEC/1000000 << " MB/sec, " << (x::cycles() / passes >> 1) << " clocks per pass "; \
x::reset(); \
#define series(n) \
if (do_test) test(n); if (do_time) time(n, 1); if (do_time) time(n, 2); \
if (do_time) time(n, 3); if (do_time) time(n, 4); if (do_time) time(n, 5); \
if (do_time) time(n, 6); if (do_time) time(n, 7); if (do_time) time(n, 8); \
if (do_time) time(n, 9); if (do_time) time(n, 10);
int N1 = 1, N2 = 500000000, NM = INT_MAX;
int mult = 1; // used to stay under timelimit on ideone
unsigned long long k = 0;
int main(int argc, char** argv)
bool do_time = 1, do_test = 1;
bool passed = true;
#ifdef HOPMAN_FAST
series(hopman_fast)
#endif
#ifdef WM_VEC
series(wm::vec)
#endif
#ifdef TM_CPP
series(tmn::cpp)
#endif
#ifdef TM_VEC
series(tmn::vec)
#endif
#ifdef AK_UNROLLED
series(ak::unrolled)
#endif
#if defined(AK_BW)
series(ak::bw)
#endif
#if defined(AK_FW)
series(ak::fw)
#endif
return k;
【讨论】:
【参考方案8】:这是我对这个有趣谜题的一点尝试。
我希望编译器能够解决所有问题,而不是使用查找表。特别是在这种情况下 - 如果您阅读 Hackers' Delight,您会看到除法和取模是如何工作的 - 这使得使用 SSE/AVX 指令优化它非常有可能。
性能基准
至于速度,我的基准测试告诉我它比 Timo 的工作快 1.5 倍(在我的 Intel Haswell 上它以大约 1 GB/s 的速度运行)。
你可以认为是作弊的事情
至于我使用的 not-making-a-std-string 作弊 - 当然我也考虑到了我对 Timo 方法的基准测试。
我确实使用了一个内在的:BSR。如果你喜欢,你也可以使用 DeBruijn 表来代替——这是我在“最快的 2log”帖子中写的很多内容之一。当然,这确实会降低性能(*好吧...如果您正在执行大量 itoa 操作,您实际上可以做出更快的 BSR,但我想这不公平...)。
工作方式
首先要做的是弄清楚我们需要多少内存。这基本上是一个 10log,可以通过多种智能方式实现。有关详细信息,请参阅经常引用的“Bit Twiddling Hacks”。
接下来要做的是执行数字输出。我为此使用了模板递归,所以编译器会弄清楚。
我使用 'modulo' 和 'div' 并排使用。如果您阅读 Hacker's Delight,您会注意到两者密切相关,因此如果您有一个答案,那么您可能也有另一个答案。我认为编译器可以弄清楚细节...... :-)
代码
使用(修改后的)log10 获取位数:
struct logarithm
static inline int log2(unsigned int value)
unsigned long index;
if (!_BitScanReverse(&index, value))
return 0;
// add 1 if x is NOT a power of 2 (to do the ceil)
return index + (value&(value - 1) ? 1 : 0);
static inline int numberDigits(unsigned int v)
static unsigned int const PowersOf10[] =
0, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000 ;
int t = (logarithm::log2(v) + 1) * 1233 >> 12; // (use a lg2 method from above)
return 1 + t - (v < PowersOf10[t]);
;
获取字符串:
template <int count>
struct WriteHelper
inline static void WriteChar(char* buf, unsigned int value)
unsigned int div = value / 10;
unsigned int rem = value % 10;
buf[count - 1] = rem + '0';
WriteHelper<count - 1>::WriteChar(buf, div);
;
template <>
struct WriteHelper<1>
inline static void WriteChar(char* buf, unsigned int value)
buf[0] = '0' + value;
;
// Boring code that converts a length into a switch.
// TODO: Test if recursion with an 'if' is faster.
static inline void WriteNumber(char* data, int len, unsigned int val)
switch (len)
case 1:
WriteHelper<1>::WriteChar(data, static_cast<unsigned int>(val));
break;
case 2:
WriteHelper<2>::WriteChar(data, static_cast<unsigned int>(val));
break;
case 3:
WriteHelper<3>::WriteChar(data, static_cast<unsigned int>(val));
break;
case 4:
WriteHelper<4>::WriteChar(data, static_cast<unsigned int>(val));
break;
case 5:
WriteHelper<5>::WriteChar(data, static_cast<unsigned int>(val));
break;
case 6:
WriteHelper<6>::WriteChar(data, static_cast<unsigned int>(val));
break;
case 7:
WriteHelper<7>::WriteChar(data, static_cast<unsigned int>(val));
break;
case 8:
WriteHelper<8>::WriteChar(data, static_cast<unsigned int>(val));
break;
case 9:
WriteHelper<9>::WriteChar(data, static_cast<unsigned int>(val));
break;
case 10:
WriteHelper<10>::WriteChar(data, static_cast<unsigned int>(val));
break;
// The main method you want to call...
static int Write(char* data, int val)
int len;
if (val >= 0)
len = logarithm::numberDigits(val);
WriteNumber(data, len, unsigned int(val));
return len;
else
unsigned int v(-val);
len = logarithm::numberDigits(v);
WriteNumber(data+1, len, v);
data[0] = '-';
return len + 1;
【讨论】:
有趣的是,我最近给了一位同事一份 Hacker's Delight。有什么特别的部分吗?当然,请注意,虽然模数和 div 都是从单个除法指令返回的,但不会以这种方式获得,因为使用硬件乘法比除法实现常数除法要快得多。 @BenVoigt 实际上,如果您在 VS2013 上运行“反汇编”,您将获得在阅读 H 的喜悦之后您所期望的代码。您要查找的章节是第 10 章。 是的,这就是我所说的使用硬件乘法的实现。 @BenVoigt 是的,当然,这就是我的意思。模数和乘法(通过常数)都使用相同的幻数,移位(arith 和 normal)。我在这里的假设是编译器能够确定它多次发出相同的指令并对其进行优化 - 由于所有操作都可以向量化,它也可能会计算出来(我们称之为奖励:-)。我对 H 感到高兴的是,如果你知道这些操作是如何编译的(整数乘法、移位),你就可以做出这些假设。【参考方案9】:我相信我已经创建了最快的整数到字符串算法。它是 Modulo 100 算法的变体,速度提高了约 33%,最重要的是,它对于较小和较大的数字都更快。它被称为脚本 ItoS 算法。要阅读解释我如何设计算法的论文@see https://github.com/kabuki-starship/kabuki-toolkit/wiki/Engineering-a-Faster-Integer-to-String-Algorithm。您可以使用该算法,但请考虑回馈Kabuki VM 并查看Script;特别是如果您对 AMIL-NLP 和/或软件定义的网络协议感兴趣。
/** Kabuki Toolkit
@version 0.x
@file ~/source/crabs/print_itos.cc
@author Cale McCollough <cale.mccollough@gmail.com>
@license Copyright (C) 2017-2018 Cale McCollough <calemccollough@gmail.com>;
All right reserved (R). Licensed under the Apache License, Version
2.0 (the "License"); you may not use this file except in
compliance with the License. You may obtain a copy of the License
[here](http://www.apache.org/licenses/LICENSE-2.0). Unless
required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied. See the License for the specific language governing
permissions and limitations under the License.
*/
#include <stdafx.h>
#include "print_itos.h"
#if MAJOR_SEAM >= 1 && MINOR_SEAM >= 1
#if MAJOR_SEAM == 1 && MINOR_SEAM == 1
#define DEBUG 1
#define PRINTF(format, ...) printf(format, __VA_ARGS__);
#define PUTCHAR(c) putchar(c);
#define PRINT_PRINTED\
sprintf_s (buffer, 24, "%u", value); *text_end = 0;\
printf ("\n Printed \"%s\" leaving value:\"%s\":%u",\
begin, buffer, (uint)strlen (buffer));
#define PRINT_BINARY PrintBinary (value);
#define PRINT_BINARY_TABLE PrintBinaryTable (value);
#else
#define PRINTF(x, ...)
#define PUTCHAR(c)
#define PRINT_PRINTED
#define PRINT_BINARY
#define PRINT_BINARY_TABLE
#endif
namespace _
void PrintLine (char c)
std::cout << '\n';
for (int i = 80; i > 0; --i)
std::cout << c;
char* Print (uint32_t value, char* text, char* text_end)
// Lookup table for powers of 10.
static const uint32_t k10ToThe[]
1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000,
1000000000, ~(uint32_t)0 ;
/** Lookup table of ASCII char pairs for 00, 01, ..., 99.
To convert this algorithm to big-endian, flip the digit pair bytes. */
static const uint16_t kDigits00To99[100] =
0x3030, 0x3130, 0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830,
0x3930, 0x3031, 0x3131, 0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731,
0x3831, 0x3931, 0x3032, 0x3132, 0x3232, 0x3332, 0x3432, 0x3532, 0x3632,
0x3732, 0x3832, 0x3932, 0x3033, 0x3133, 0x3233, 0x3333, 0x3433, 0x3533,
0x3633, 0x3733, 0x3833, 0x3933, 0x3034, 0x3134, 0x3234, 0x3334, 0x3434,
0x3534, 0x3634, 0x3734, 0x3834, 0x3934, 0x3035, 0x3135, 0x3235, 0x3335,
0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935, 0x3036, 0x3136, 0x3236,
0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936, 0x3037, 0x3137,
0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937, 0x3038,
0x3138, 0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938,
0x3039, 0x3139, 0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839,
0x3939, ;
static const char kMsbShift[] = 4, 7, 11, 14, 17, 21, 24, 27, 30, ;
if (!text)
return nullptr;
if (text >= text_end)
return nullptr;
uint16_t* text16;
char digit;
uint32_t scalar;
uint16_t digits1and2,
digits3and4,
digits5and6,
digits7and8;
uint32_t comparator;
#if MAJOR_SEAM == 1 && MINOR_SEAM == 1
// Write a bunches of xxxxxx to the buffer for debug purposes.
for (int i = 0; i <= 21; ++i)
*(text + i) = 'x';
*(text + 21) = 0;
char* begin = text;
char buffer[256];
#endif
if (value < 10)
PRINTF ("\n Range:[0, 9] length:1 ")
if (text + 1 >= text_end)
return nullptr;
*text++ = '0' + (char)value;
PRINT_PRINTED
return text;
if (value < 100)
PRINTF ("\n Range:[10, 99] length:2 ")
if (text + 2 >= text_end)
return nullptr;
*reinterpret_cast<uint16_t*> (text) = kDigits00To99[value];
PRINT_PRINTED
return text + 2;
if (value >> 14)
if (value >> 27)
if (value >> 30)
PRINTF ("\n Range:[1073741824, 4294967295] length:10")
Print10:
if (text + 10 >= text_end)
return nullptr;
comparator = 100000000;
digits1and2 = (uint16_t)(value / comparator);
PRINTF ("\n digits1and2:%u", digits1and2)
value -= digits1and2 * comparator;
*reinterpret_cast<uint16_t*> (text) = kDigits00To99[digits1and2];
PRINT_PRINTED
text += 2;
goto Print8;
else
comparator = 1000000000;
if (value >= comparator)
PRINTF ("\n Range:[100000000, 1073741823] length:10")
goto Print10;
PRINTF ("\n Range:[134217727, 999999999] length:9")
if (text + 9 >= text_end)
return nullptr;
comparator = 100000000;
digit = (char)(value / comparator);
*text++ = digit + '0';
PRINT_PRINTED
value -= comparator * digit;
goto Print8;
else if (value >> 24)
comparator = k10ToThe[8];
if (value >= comparator)
PRINTF ("\n Range:[100000000, 134217728] length:9")
if (text + 9 >= text_end)
return nullptr;
*text++ = '1';
PRINT_PRINTED
value -= comparator;
PRINTF ("\n Range:[16777216, 9999999] length:8")
if (text + 8 >= text_end)
return nullptr;
Print8:
PRINTF ("\n Print8:")
scalar = 10000;
digits5and6 = (uint16_t)(value / scalar);
digits1and2 = value - scalar * digits5and6;
digits7and8 = digits5and6 / 100;
digits3and4 = digits1and2 / 100;
digits5and6 -= 100 * digits7and8;
digits1and2 -= 100 * digits3and4;
*reinterpret_cast<uint16_t*> (text + 6) =
kDigits00To99[digits1and2];
PRINT_PRINTED
*reinterpret_cast<uint16_t*> (text + 4) =
kDigits00To99[digits3and4];
PRINT_PRINTED
*reinterpret_cast<uint16_t*> (text + 2) =
kDigits00To99[digits5and6];
PRINT_PRINTED
*reinterpret_cast<uint16_t*> (text) =
kDigits00To99[digits7and8];
PRINT_PRINTED
return text + 8;
else if (value >> 20)
comparator = 10000000;
if (value >= comparator)
PRINTF ("\n Range:[10000000, 16777215] length:8")
if (text + 8 >= text_end)
return nullptr;
*text++ = '1';
PRINT_PRINTED
value -= comparator;
else
PRINTF ("\n Range:[1048576, 9999999] length:7")
if (text + 7 >= text_end)
return nullptr;
scalar = 10000;
digits5and6 = (uint16_t)(value / scalar);
digits1and2 = value - scalar * digits5and6;
digits7and8 = digits5and6 / 100;
digits3and4 = digits1and2 / 100;
digits5and6 -= 100 * digits7and8;
digits1and2 -= 100 * digits3and4;;
*reinterpret_cast<uint16_t*> (text + 5) =
kDigits00To99[digits1and2];
PRINT_PRINTED
*reinterpret_cast<uint16_t*> (text + 3) =
kDigits00To99[digits3and4];
PRINT_PRINTED
*reinterpret_cast<uint16_t*> (text + 1) =
kDigits00To99[digits5and6];
PRINT_PRINTED
*text = (char)digits7and8 + '0';
return text + 7;
else if (value >> 17)
comparator = 1000000;
if (value >= comparator)
PRINTF ("\n Range:[100000, 1048575] length:7")
if (text + 7 >= text_end)
return nullptr;
*text++ = '1';
PRINT_PRINTED
value -= comparator;
else
PRINTF ("\n Range:[131072, 999999] length:6")
if (text + 6 >= text_end)
return nullptr;
Print6:
scalar = 10000;
digits5and6 = (uint16_t)(value / scalar);
digits1and2 = value - scalar * digits5and6;
digits7and8 = digits5and6 / 100;
digits3and4 = digits1and2 / 100;
digits5and6 -= 100 * digits7and8;
digits1and2 -= 100 * digits3and4;
text16 = reinterpret_cast<uint16_t*> (text + 6);
*reinterpret_cast<uint16_t*> (text + 4) = kDigits00To99[digits1and2];
PRINT_PRINTED
*reinterpret_cast<uint16_t*> (text + 2) = kDigits00To99[digits3and4];
PRINT_PRINTED
*reinterpret_cast<uint16_t*> (text ) = kDigits00To99[digits5and6];
PRINT_PRINTED
return text + 6;
else // (value >> 14)
if (value >= 100000)
PRINTF ("\n Range:[65536, 131071] length:6")
goto Print6;
PRINTF ("\n Range:[10000, 65535] length:5")
if (text + 5 >= text_end)
return nullptr;
digits5and6 = 10000;
digit = (uint8_t)(value / digits5and6);
value -= digits5and6 * digit;
*text = digit + '0';
PRINT_PRINTED
digits1and2 = (uint16_t)value;
digits5and6 = 100;
digits3and4 = digits1and2 / digits5and6;
digits1and2 -= digits3and4 * digits5and6;
*reinterpret_cast<uint16_t*> (text + 1) =
kDigits00To99[digits3and4];
PRINT_PRINTED
PRINTF ("\n digits1and2:%u", digits1and2)
*reinterpret_cast<uint16_t*> (text + 3) =
kDigits00To99[digits1and2];
PRINT_PRINTED
return text + 5;
digits1and2 = (uint16_t)value;
if (value >> 10)
digits5and6 = 10000;
if (digits1and2 >= digits5and6)
if (text + 5 >= text_end)
return nullptr;
PRINTF ("\n Range:[10000, 16383] length:5")
*text++ = '1';
PRINT_PRINTED
digits1and2 -= digits5and6;
else
PRINTF ("\n Range:[1024, 9999] length:4")
if (text + 4 >= text_end)
return nullptr;
digits5and6 = 100;
digits3and4 = digits1and2 / digits5and6;
digits1and2 -= digits3and4 * digits5and6;
*reinterpret_cast<uint16_t*> (text ) = kDigits00To99[digits3and4];
PRINT_PRINTED
*reinterpret_cast<uint16_t*> (text + 2) = kDigits00To99[digits1and2];
PRINT_PRINTED
return text + 4;
else
if (text + 4 >= text_end)
return nullptr;
digits3and4 = 1000;
if (digits1and2 >= digits3and4)
PRINTF ("\n Range:[1000, 1023] length:4")
digits1and2 -= digits3and4;
text16 = reinterpret_cast<uint16_t*> (text + 2);
*text16-- = kDigits00To99[digits1and2];
PRINT_PRINTED
*text16 = (((uint16_t)'1') | (((uint16_t)'0') << 8));
PRINT_PRINTED
return text + 4;
PRINTF ("\n Range:[100, 999] length:3")
digits1and2 = (uint16_t)value;
digits3and4 = 100;
digit = (char)(digits1and2 / digits3and4);
digits1and2 -= digit * digits3and4;
*text = digit + '0';
PRINT_PRINTED
*reinterpret_cast<uint16_t*> (text + 1) = kDigits00To99[digits1and2];
PRINT_PRINTED
return text + 3;
//< namespace _
#undef PRINTF
#undef PRINT_PRINTED
#endif //< MAJOR_SEAM >= 1 && MINOR_SEAM >= 1
作者
Cale McCollough【讨论】:
仅供参考:通过在 Stack Overflow 上发布此内容,您已不可撤销地将其发布在 CC BY-SA 3.0 下(根据 Stack Exchange 使用条款)。您声明它是根据 GPL 3 发布的,这构成了一个附加许可证,用户可以选择性地将其用作 CC BY-SA 3.0 的替代品。使用哪个许可证由复制代码的用户自行决定。如果这对您来说是个问题,我建议您获得合格的法律建议。 (IANAL) 请注意,这本身并没有什么问题,但我认为应该引起您的注意。 非常好。但是,它需要返回一个std::string
,以便与此处列出的其他方法的比较有效。起初我无法弄清楚二叉搜索树如何使用移位运算符,因为比较已经非常快了,但现在我意识到,如果你需要它,它对于预先计算移位值很有用。不过,你不使用它。另一方面,您最终不会得到在指令中编码的大型文字,所以也许这本身就足够了。
我忘了这样做。这只是另一个包装函数。我所有的东西都是 Apache 许可的,但我想我会尝试 GNU,但是是的......这没有任何意义。
好的,我把license改回来,添加了string函数。 Script 是一个基于套接字的分布式计算语言家族,可以在带有 Chinese Room 的超级计算机上完成我的 IGEEK。我的字符串类是一个环形缓冲区。 :-)-+=
我刚刚更新了算法并显着提高了更大数字的性能。【参考方案10】:
修改 user434507 的解决方案。修改为使用字符数组而不是 C++ 字符串。跑得快一点。还将代码中对 0 的检查移到了较低的位置……因为这在我的特定情况下从未发生过。如果您的情况更常见,请将其移回。
// Int2Str.cpp : Defines the entry point for the console application.
//
#include <stdio.h>
#include <iostream>
#include "StopWatch.h"
using namespace std;
const char digit_pairs[201] =
"00010203040506070809"
"10111213141516171819"
"20212223242526272829"
"30313233343536373839"
"40414243444546474849"
"50515253545556575859"
"60616263646566676869"
"70717273747576777879"
"80818283848586878889"
"90919293949596979899"
;
void itostr(int n, char* c)
int sign = -(n<0);
unsigned int val = (n^sign)-sign;
int size;
if(val>=10000)
if(val>=10000000)
if(val>=1000000000)
size=10;
else if(val>=100000000)
size=9;
else size=8;
else
if(val>=1000000)
size=7;
else if(val>=100000)
size=6;
else size=5;
else
if(val>=100)
if(val>=1000)
size=4;
else size=3;
else
if(val>=10)
size=2;
else if(n==0)
c[0]='0';
c[1] = '\0';
return;
else size=1;
size -= sign;
if(sign)
*c='-';
c += size-1;
while(val>=100)
int pos = val % 100;
val /= 100;
*(short*)(c-1)=*(short*)(digit_pairs+2*pos);
c-=2;
while(val>0)
*c--='0' + (val % 10);
val /= 10;
c[size+1] = '\0';
void itostr(unsigned val, char* c)
int size;
if(val>=10000)
if(val>=10000000)
if(val>=1000000000)
size=10;
else if(val>=100000000)
size=9;
else
size=8;
else
if(val>=1000000)
size=7;
else if(val>=100000)
size=6;
else
size=5;
else
if(val>=100)
if(val>=1000)
size=4;
else
size=3;
else
if(val>=10)
size=2;
else if (val==0)
c[0]='0';
c[1] = '\0';
return;
else
size=1;
c += size-1;
while(val>=100)
int pos = val % 100;
val /= 100;
*(short*)(c-1)=*(short*)(digit_pairs+2*pos);
c-=2;
while(val>0)
*c--='0' + (val % 10);
val /= 10;
c[size+1] = '\0';
void test()
bool foundmismatch = false;
char str[16];
char compare[16];
for(int i = -1000000; i < 1000000; i++)
int random = rand();
itostr(random, str);
itoa(random, compare, 10);
if(strcmp(str, compare) != 0)
cout << "Mismatch found: " << endl;
cout << "Generated: " << str << endl;
cout << "Reference: " << compare << endl;
foundmismatch = true;
if(!foundmismatch)
cout << "No mismatch found!" << endl;
cin.get();
void benchmark()
StopWatch stopwatch;
stopwatch.setup("Timer");
stopwatch.reset();
stopwatch.start();
char str[16];
for(unsigned int i = 0; i < 2000000; i++)
itostr(i, str);
stopwatch.stop();
cin.get();
int main( int argc, const char* argv[])
benchmark();
【讨论】:
我已经从 0x80000000 到 0x7FFFFFFF 对其进行了测试,并且已经在 -999999999 处得到了无效值(在几次不匹配后我已经停止了)。Mismatch found: Generated: -9999999990 Reference: -999999999 Mismatch found: Generated: -9999999980 Reference: -999999998 Mismatch found: Generated: -9999999970 Reference: -999999997
【参考方案11】:
我们使用以下代码(用于 MSVC):
模板化 tBitScanReverse:
#include <intrin.h>
namespace intrin
#pragma intrinsic(_BitScanReverse)
#pragma intrinsic(_BitScanReverse64)
template<typename TIntegerValue>
__forceinline auto tBitScanReverse(DWORD * out_index, TIntegerValue mask)
-> std::enable_if_t<(std::is_integral<TIntegerValue>::value && sizeof(TIntegerValue) == 4), unsigned char>
return _BitScanReverse(out_index, mask);
template<typename TIntegerValue>
__forceinline auto tBitScanReverse(DWORD * out_index, TIntegerValue mask)
-> std::enable_if_t<(std::is_integral<TIntegerValue>::value && sizeof(TIntegerValue) == 8), unsigned char>
#if !(_M_IA64 || _M_AMD64)
auto res = _BitScanReverse(out_index, (unsigned long)(mask >> 32));
if (res)
out_index += 32;
return res;
return _BitScanReverse(out_index, (unsigned long)mask);
#else
return _BitScanReverse64(out_index, mask);
#endif
char/wchar_t 助手:
template<typename TChar> inline constexpr TChar ascii_0();
template<> inline constexpr char ascii_0() return '0';
template<> inline constexpr wchar_t ascii_0() return L'0';
template<typename TChar, typename TInt> inline constexpr TChar ascii_DEC(TInt d) return (TChar)(ascii_0<TChar>() + d);
10 表的幂:
static uint32 uint32_powers10[] =
1,
10,
100,
1000,
10000,
100000,
1000000,
10000000,
100000000,
1000000000
// 123456789
;
static uint64 uint64_powers10[] =
1ULL,
10ULL,
100ULL,
1000ULL,
10000ULL,
100000ULL,
1000000ULL,
10000000ULL,
100000000ULL,
1000000000ULL,
10000000000ULL,
100000000000ULL,
1000000000000ULL,
10000000000000ULL,
100000000000000ULL,
1000000000000000ULL,
10000000000000000ULL,
100000000000000000ULL,
1000000000000000000ULL,
10000000000000000000ULL
// 1234567890123456789
;
template<typename TUint> inline constexpr const TUint * powers10();
template<> inline constexpr const uint32 * powers10() return uint32_powers10;
template<> inline constexpr const uint64 * powers10() return uint64_powers10;
实际打印:
template<typename TChar, typename TUInt>
__forceinline auto
print_dec(
TUInt u,
TChar * & buffer) -> typename std::enable_if_t<std::is_unsigned<TUInt>::value>
if (u < 10) // 1-digit, including 0
*buffer++ = ascii_DEC<TChar>(u);
else
DWORD log2u;
intrin::tBitScanReverse(&log2u, u); // log2u [3,31] (u >= 10)
DWORD log10u = ((log2u + 1) * 77) >> 8; // log10u [1,9] 77/256 = ln(2) / ln(10)
DWORD digits = log10u + (u >= powers10<TUInt>()[log10u]); // digits [2,10]
buffer += digits;
auto p = buffer;
for (--digits; digits; --digits)
auto x = u / 10, d = u - x * 10;
*--p = ascii_DEC<TChar>(d);
u = x;
*--p = ascii_DEC<TChar>(u);
最后一个循环可以展开:
switch (digits)
case 10: auto x = u / 10, d = u - x * 10; *--p = ascii_DEC<TChar>(d); u = x;
case 9: auto x = u / 10, d = u - x * 10; *--p = ascii_DEC<TChar>(d); u = x;
case 8: auto x = u / 10, d = u - x * 10; *--p = ascii_DEC<TChar>(d); u = x;
case 7: auto x = u / 10, d = u - x * 10; *--p = ascii_DEC<TChar>(d); u = x;
case 6: auto x = u / 10, d = u - x * 10; *--p = ascii_DEC<TChar>(d); u = x;
case 5: auto x = u / 10, d = u - x * 10; *--p = ascii_DEC<TChar>(d); u = x;
case 4: auto x = u / 10, d = u - x * 10; *--p = ascii_DEC<TChar>(d); u = x;
case 3: auto x = u / 10, d = u - x * 10; *--p = ascii_DEC<TChar>(d); u = x;
case 2: auto x = u / 10, d = u - x * 10; *--p = ascii_DEC<TChar>(d); u = x; *--p = ascii_DEC<TChar>(u); break;
default: __assume(0);
主要思想与@atlaste之前建议的相同:https://***.com/a/29039967/2204001
【讨论】:
【参考方案12】:因为最近的活动才发现这个;我真的没有时间添加基准测试,但我想添加我过去写的内容,以便在需要快速整数到字符串转换时使用......
https://github.com/CarloWood/ai-utils/blob/master/itoa.hhttps://github.com/CarloWood/ai-utils/blob/master/itoa.cxx
这里使用的技巧是用户必须提供一个很大的 std::array 足够(在他们的堆栈上)并且此代码将字符串写入 向后,从单位开始,然后返回一个指向数组的指针,该数组的偏移量指向结果实际开始的位置。
因此这不会分配或移动内存,但它仍然需要每个结果数字的除法和模数(我认为这足够快,因为这只是在 CPU 内部运行的代码;内存访问通常是问题 imho )。
【讨论】:
【参考方案13】:当商和余数都需要时,为什么没有人使用标准库中的 div 函数? 使用 Timo 的源代码,我得到了这样的结果:
if(val >= 0)
div_t d2 = div(val,100);
while(d2.quot)
COPYPAIR(it,2 * d2.rem);
it-=2;
d2 = div(d2.quot,100);
COPYPAIR(it,2*d2.rem);
if(d2.quot<10)
it++;
else
div_t d2 = div(val,100);
while(d2.quot)
COPYPAIR(it,-2 * d2.rem);
it-=2;
d2 = div(d2.quot,100);
COPYPAIR(it,-2*d2.rem);
if(d2.quot<=-10)
it--;
*it = '-';
好的,对于 unsigned int 的,不能使用 div 函数,但 unsigned 可以单独处理。 我已经如下定义了 COPYPAIR 宏来测试如何从 digit_pairs 复制 2 个字符的变体(发现这些方法中的任何一个都没有明显的优势):
#define COPYPAIR0(_p,_i) memcpy((_p), &digit_pairs[(_i)], 2);
#define COPYPAIR1(_p,_i) (_p)[0] = digit_pairs[(_i)]; (_p)[1] = digit_pairs[(_i)+1];
#define COPYPAIR2(_p,_i) unsigned short * d = (unsigned short *)(_p); unsigned short * s = (unsigned short *)&digit_pairs[(_i)]; *d = *s;
#define COPYPAIR COPYPAIR2
【讨论】:
因为这个挑战是关于速度,而不是最少的代码行数。 PS:对于那些想在我的解决方案中使用它的人来说:(1)它要慢得多,(2)因为 div 适用于 signed 整数 - 这会破坏绝对(INT32_MIN)。以上是关于C++ 性能挑战:整数到 std::string 的转换的主要内容,如果未能解决你的问题,请参考以下文章
c++ std::ostringstream vs std::string::append
如何从 std::string 中获取 2 个字符并将其转换为 C++ 中的 int?
如何在 C++ 中将 char 指针附加到 std::string