启用 C++11 时的 std::vector 性能回归

Posted 2023-02-14

技术标签:

【中文标题】启用 C++11 时的 std::vector 性能回归【英文标题】：std::vector performance regression when enabling C++11 【发布时间】：2014-01-25 11:54:47 【问题描述】：

当我启用 C++11 时，我在一个小的 C++ sn-p 中发现了一个有趣的性能回归：

#include <vector>

struct Item

  int a;
  int b;
;

int main()

  const std::size_t num_items = 10000000;
  std::vector<Item> container;
  container.reserve(num_items);
  for (std::size_t i = 0; i < num_items; ++i) 
    container.push_back(Item());
  
  return 0;

使用 g++ (GCC) 4.8.2 20131219（预发布）和 C++03 我得到：

milian:/tmp$ g++ -O3 main.cpp && perf stat -r 10 ./a.out

Performance counter stats for './a.out' (10 runs):

        35.206824 task-clock                #    0.988 CPUs utilized            ( +-  1.23% )
                4 context-switches          #    0.116 K/sec                    ( +-  4.38% )
                0 cpu-migrations            #    0.006 K/sec                    ( +- 66.67% )
              849 page-faults               #    0.024 M/sec                    ( +-  6.02% )
       95,693,808 cycles                    #    2.718 GHz                      ( +-  1.14% ) [49.72%]
  <not supported> stalled-cycles-frontend 
  <not supported> stalled-cycles-backend  
       95,282,359 instructions              #    1.00  insns per cycle          ( +-  0.65% ) [75.27%]
       30,104,021 branches                  #  855.062 M/sec                    ( +-  0.87% ) [77.46%]
            6,038 branch-misses             #    0.02% of all branches          ( +- 25.73% ) [75.53%]

      0.035648729 seconds time elapsed                                          ( +-  1.22% )

另一方面，启用 C++11 后，性能会显着下降：

milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out

Performance counter stats for './a.out' (10 runs):

        86.485313 task-clock                #    0.994 CPUs utilized            ( +-  0.50% )
                9 context-switches          #    0.104 K/sec                    ( +-  1.66% )
                2 cpu-migrations            #    0.017 K/sec                    ( +- 26.76% )
              798 page-faults               #    0.009 M/sec                    ( +-  8.54% )
      237,982,690 cycles                    #    2.752 GHz                      ( +-  0.41% ) [51.32%]
  <not supported> stalled-cycles-frontend 
  <not supported> stalled-cycles-backend  
      135,730,319 instructions              #    0.57  insns per cycle          ( +-  0.32% ) [75.77%]
       30,880,156 branches                  #  357.057 M/sec                    ( +-  0.25% ) [75.76%]
            4,188 branch-misses             #    0.01% of all branches          ( +-  7.59% ) [74.08%]

    0.087016724 seconds time elapsed                                          ( +-  0.50% )

有人能解释一下吗？到目前为止，我的经验是 STL 通过启用 C++11 变得更快，尤其是。感谢移动语义。

编辑： 正如建议的那样，使用 container.emplace_back(); 代替，性能与 C++03 版本相当。 C++03版本如何实现push_back同样的效果？

milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out

Performance counter stats for './a.out' (10 runs):

        36.229348 task-clock                #    0.988 CPUs utilized            ( +-  0.81% )
                4 context-switches          #    0.116 K/sec                    ( +-  3.17% )
                1 cpu-migrations            #    0.017 K/sec                    ( +- 36.85% )
              798 page-faults               #    0.022 M/sec                    ( +-  8.54% )
       94,488,818 cycles                    #    2.608 GHz                      ( +-  1.11% ) [50.44%]
  <not supported> stalled-cycles-frontend 
  <not supported> stalled-cycles-backend  
       94,851,411 instructions              #    1.00  insns per cycle          ( +-  0.98% ) [75.22%]
       30,468,562 branches                  #  840.991 M/sec                    ( +-  1.07% ) [76.71%]
            2,723 branch-misses             #    0.01% of all branches          ( +-  9.84% ) [74.81%]

   0.036678068 seconds time elapsed                                          ( +-  0.80% )

【问题讨论】：

如果你编译成程序集，你可以看到引擎盖下发生了什么。另见***.com/questions/8021874/… 如果在 C++11 版本中将 push_back(Item()) 更改为 emplace_back() 会发生什么？见上文，“修复”回归。我仍然想知道为什么 push_back 在 C++03 和 C++11 之间的性能会倒退。 @milianw 原来我编译了错误的程序。忽略我的cmets。使用 clang3.4，C++11 版本更快，0.047s vs C++98 版本 0.058s 【参考方案1】：

我可以使用您在帖子中写的选项在我的机器上重现您的结果。

但是，如果我也启用link time optimization（我还将-flto 标志传递给gcc 4.7.2），结果是相同的：

（我正在编译你的原始代码，container.push_back(Item());）

$ g++ -std=c++11 -O3 -flto regr.cpp && perf stat -r 10 ./a.out 

 Performance counter stats for './a.out' (10 runs):

         35.426793 task-clock                #    0.986 CPUs utilized            ( +-  1.75% )
                 4 context-switches          #    0.116 K/sec                    ( +-  5.69% )
                 0 CPU-migrations            #    0.006 K/sec                    ( +- 66.67% )
            19,801 page-faults               #    0.559 M/sec                  
        99,028,466 cycles                    #    2.795 GHz                      ( +-  1.89% ) [77.53%]
        50,721,061 stalled-cycles-frontend   #   51.22% frontend cycles idle     ( +-  3.74% ) [79.47%]
        25,585,331 stalled-cycles-backend    #   25.84% backend  cycles idle     ( +-  4.90% ) [73.07%]
       141,947,224 instructions              #    1.43  insns per cycle        
                                             #    0.36  stalled cycles per insn  ( +-  0.52% ) [88.72%]
        37,697,368 branches                  # 1064.092 M/sec                    ( +-  0.52% ) [88.75%]
            26,700 branch-misses             #    0.07% of all branches          ( +-  3.91% ) [83.64%]

       0.035943226 seconds time elapsed                                          ( +-  1.79% )



$ g++ -std=c++98 -O3 -flto regr.cpp && perf stat -r 10 ./a.out 

 Performance counter stats for './a.out' (10 runs):

         35.510495 task-clock                #    0.988 CPUs utilized            ( +-  2.54% )
                 4 context-switches          #    0.101 K/sec                    ( +-  7.41% )
                 0 CPU-migrations            #    0.003 K/sec                    ( +-100.00% )
            19,801 page-faults               #    0.558 M/sec                    ( +-  0.00% )
        98,463,570 cycles                    #    2.773 GHz                      ( +-  1.09% ) [77.71%]
        50,079,978 stalled-cycles-frontend   #   50.86% frontend cycles idle     ( +-  2.20% ) [79.41%]
        26,270,699 stalled-cycles-backend    #   26.68% backend  cycles idle     ( +-  8.91% ) [74.43%]
       141,427,211 instructions              #    1.44  insns per cycle        
                                             #    0.35  stalled cycles per insn  ( +-  0.23% ) [87.66%]
        37,366,375 branches                  # 1052.263 M/sec                    ( +-  0.48% ) [88.61%]
            26,621 branch-misses             #    0.07% of all branches          ( +-  5.28% ) [83.26%]

       0.035953916 seconds time elapsed

至于原因，需要查看生成的汇编代码（g++ -std=c++11 -O3 -S regr.cpp）。 在 C++11 模式下，生成的代码比在 C++98 模式和内联函数中明显更混乱void std::vector<Item,std::allocator<Item>>::_M_emplace_back_aux<Item>(Item&&)在 C++11 模式下使用默认 inline-limit 失败。

这个失败的内联有一个多米诺骨牌效应。不是因为这个函数正在被调用（它甚至没有被调用！）但是因为我们必须做好准备：如果它被调用，函数参数（Item.a 和 Item.b）必须已经在正确的位置。这导致一个相当混乱的代码。

以下是内联成功的情况下生成的代码的相关部分：

.L42:
    testq   %rbx, %rbx  # container$D13376$_M_impl$_M_finish
    je  .L3 #,
    movl    $0, (%rbx)  #, container$D13376$_M_impl$_M_finish_136->a
    movl    $0, 4(%rbx) #, container$D13376$_M_impl$_M_finish_136->b
.L3:
    addq    $8, %rbx    #, container$D13376$_M_impl$_M_finish
    subq    $1, %rbp    #, ivtmp.106
    je  .L41    #,
.L14:
    cmpq    %rbx, %rdx  # container$D13376$_M_impl$_M_finish, container$D13376$_M_impl$_M_end_of_storage
    jne .L42    #,

这是一个漂亮而紧凑的 for 循环。现在，让我们将其与 内联失败 的情况进行比较：

.L49:
    testq   %rax, %rax  # D.15772
    je  .L26    #,
    movq    16(%rsp), %rdx  # D.13379, D.13379
    movq    %rdx, (%rax)    # D.13379, *D.15772_60
.L26:
    addq    $8, %rax    #, tmp75
    subq    $1, %rbx    #, ivtmp.117
    movq    %rax, 40(%rsp)  # tmp75, container.D.13376._M_impl._M_finish
    je  .L48    #,
.L28:
    movq    40(%rsp), %rax  # container.D.13376._M_impl._M_finish, D.15772
    cmpq    48(%rsp), %rax  # container.D.13376._M_impl._M_end_of_storage, D.15772
    movl    $0, 16(%rsp)    #, D.13379.a
    movl    $0, 20(%rsp)    #, D.13379.b
    jne .L49    #,
    leaq    16(%rsp), %rsi  #,
    leaq    32(%rsp), %rdi  #,
    call    _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_   #

这段代码很杂乱，循环中发生的事情比前一种情况多得多。在函数call（显示的最后一行）之前，必须正确放置参数：

leaq    16(%rsp), %rsi  #,
leaq    32(%rsp), %rdi  #,
call    _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_   #

即使这从未真正执行过，循环也会安排之前的事情：

movl    $0, 16(%rsp)    #, D.13379.a
movl    $0, 20(%rsp)    #, D.13379.b

这导致代码乱七八糟。如果因为内联成功没有函数call，我们在循环中只有 2 个移动指令，%rsp（堆栈指针）没有任何问题。但是，如果内联失败，我们会得到 6 步，并且会与 %rsp 混淆很多。

只是为了证实我的理论（注意-finline-limit），都在 C++11 模式下：

 $ g++ -std=c++11 -O3 -finline-limit=105 regr.cpp && perf stat -r 10 ./a.out

 Performance counter stats for './a.out' (10 runs):

         84.739057 task-clock                #    0.993 CPUs utilized            ( +-  1.34% )
                 8 context-switches          #    0.096 K/sec                    ( +-  2.22% )
                 1 CPU-migrations            #    0.009 K/sec                    ( +- 64.01% )
            19,801 page-faults               #    0.234 M/sec                  
       266,809,312 cycles                    #    3.149 GHz                      ( +-  0.58% ) [81.20%]
       206,804,948 stalled-cycles-frontend   #   77.51% frontend cycles idle     ( +-  0.91% ) [81.25%]
       129,078,683 stalled-cycles-backend    #   48.38% backend  cycles idle     ( +-  1.37% ) [69.49%]
       183,130,306 instructions              #    0.69  insns per cycle        
                                             #    1.13  stalled cycles per insn  ( +-  0.85% ) [85.35%]
        38,759,720 branches                  #  457.401 M/sec                    ( +-  0.29% ) [85.43%]
            24,527 branch-misses             #    0.06% of all branches          ( +-  2.66% ) [83.52%]

       0.085359326 seconds time elapsed                                          ( +-  1.31% )

 $ g++ -std=c++11 -O3 -finline-limit=106 regr.cpp && perf stat -r 10 ./a.out

 Performance counter stats for './a.out' (10 runs):

         37.790325 task-clock                #    0.990 CPUs utilized            ( +-  2.06% )
                 4 context-switches          #    0.098 K/sec                    ( +-  5.77% )
                 0 CPU-migrations            #    0.011 K/sec                    ( +- 55.28% )
            19,801 page-faults               #    0.524 M/sec                  
       104,699,973 cycles                    #    2.771 GHz                      ( +-  2.04% ) [78.91%]
        58,023,151 stalled-cycles-frontend   #   55.42% frontend cycles idle     ( +-  4.03% ) [78.88%]
        30,572,036 stalled-cycles-backend    #   29.20% backend  cycles idle     ( +-  5.31% ) [71.40%]
       140,669,773 instructions              #    1.34  insns per cycle        
                                             #    0.41  stalled cycles per insn  ( +-  1.40% ) [88.14%]
        38,117,067 branches                  # 1008.646 M/sec                    ( +-  0.65% ) [89.38%]
            27,519 branch-misses             #    0.07% of all branches          ( +-  4.01% ) [86.16%]

       0.038187580 seconds time elapsed                                          ( +-  2.05% )

确实，如果我们要求编译器稍微努力地内联该函数，性能上的差异就会消失。

那么，这个故事的收获是什么？失败的内联会让你付出很多代价，你应该充分利用编译器的能力：我只能推荐链接时间优化。它显着提升了我的程序的性能（高达 2.5 倍）并且我需要做的就是传递-flto 标志。这是一笔不错的交易！ ;)

但是，我不建议您使用 inline 关键字丢弃您的代码；让编译器决定要做什么。（无论如何，优化器都可以将 inline 关键字视为空格。）

好问题，+1！

【讨论】：

注：inline 与函数内联无关；它的意思是“定义内联”而不是“请内联”。如果您想实际要求内联，请使用 __attribute__((always_inline)) 或类似名称。 @JonPurdy 不完全是，例如类成员函数是隐式内联的。 inline 也是对编译器的请求，您希望函数被内联，例如英特尔 C++ 编译器在不满足您的请求时会发出性能警告。（如果它仍然存在，我最近没有检查 icc。）不幸的是，我看到人们用inline 破坏他们的代码并等待奇迹发生。我不会使用__attribute__((always_inline))；编译器开发人员很可能更清楚内联什么和不内联什么。（尽管这里有反例。） @JonPurdy 另一方面，如果你定义了一个内联函数不是类的成员函数，那么你确实别无选择，只能将其标记为内联您将从链接器获得多个定义错误。如果这就是你的意思，那么好吧。是的，我就是这个意思。标准确实说“inline 说明符向实现表明，在调用点对函数体进行内联替换优于通常的函数调用机制。” （第 7.1.2.2 节）但是，执行该优化不需要实现，因为 inline 函数经常恰好是内联的良好候选者，这在很大程度上是巧合。所以最好是明确的并使用编译器编译指示。 @JonPurdy 至于前半部分：是的，这就是我所说的 “优化器 无论如何都允许将内联关键字视为空白的意思。 " 至于编译器杂注，我不会使用它，我会留给链接时间优化是否内联。它做得很好；它还自动解决了答案中讨论的这个问题。

以上是关于启用 C++11 时的 std::vector 性能回归的主要内容，如果未能解决你的问题，请参考以下文章