gcc -Ofast -pg 错误的 gprof 结果
Posted
技术标签:
【中文标题】gcc -Ofast -pg 错误的 gprof 结果【英文标题】:wrong gprof results with gcc -Ofast -pg 【发布时间】:2016-08-16 03:54:50 【问题描述】:我看到 gcc -pg
和 gprof
的分析结果很奇怪。
我找不到更好/更小的重现方法,所以我将实际代码链接到我看到问题的地方。
我正在使用代码 here,使用 make prof
构建 gprof 分析结果
我注意到几个错误,主要是:
-
一个根本不应该被调用的函数
shape_stream_test
显示为被 ai_best_move_rec
大量调用,并且在自身时间中占用了总运行时间的 15%
ai_best_move_rec
合法调用的其他函数,例如 grid_block_remove
、grid_cpy
、grid_new
甚至不会作为子函数出现
实际教授统计数据:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
53.92 4.00 4.00 4231936 0.00 0.00 grid_eval
13.62 5.01 1.01 4467368 0.00 0.00 shape_stream_test
6.88 5.52 0.51 4690742 0.00 0.00 grid_block_center_top
6.07 5.97 0.45 4521016 0.00 0.00 grid_block_valid
5.80 6.40 0.43 4467368 0.00 0.00 grid_block_add
3.37 6.65 0.25 4467368 0.00 0.00 grid_block_drop
2.02 6.80 0.15 35 4.29 209.20 ai_best_move_rec
1.95 6.95 0.15 235469 0.00 0.00 grid_init
1.75 7.08 0.13 9212461 0.00 0.00 block_extreme
1.48 7.19 0.11 4467402 0.00 0.00 block_move
0.94 7.26 0.07 1654899 0.00 0.00 block_get
0.54 7.30 0.04 block_crust_get
0.40 7.33 0.03 grid_block_set_color
0.27 7.35 0.02 4467368 0.00 0.00 grid_block_remove
0.27 7.37 0.02 235501 0.00 0.00 block_new
0.27 7.39 0.02 grid_block_intersects
0.20 7.40 0.02 235469 0.00 0.00 grid_new
0.13 7.41 0.01 235467 0.00 0.00 shape_stream_peek
0.13 7.42 0.01 drop_amount
0.00 7.42 0.00 621 0.00 0.00 grid_clear_lines
0.00 7.42 0.00 587 0.00 0.00 grid_cpy
0.00 7.42 0.00 35 0.00 0.00 grid_print
0.00 7.42 0.00 34 0.00 0.00 block_init
0.00 7.42 0.00 34 0.00 0.00 block_print
0.00 7.42 0.00 34 0.00 0.00 game_move_print
0.00 7.42 0.00 34 0.00 0.00 grid_apply_moves
0.00 7.42 0.00 34 0.00 0.00 shape_stream_pop
0.00 7.42 0.00 7 0.00 0.00 shape_new
0.00 7.42 0.00 1 0.00 0.00 shape_stream_new
% the percentage of the total running time of the
time program used by this function.
cumulative a running sum of the number of seconds accounted
seconds for by this function and those listed above it.
self the number of seconds accounted for by this
seconds function alone. This is the major sort for this
listing.
calls the number of times this function was invoked, if
this function is profiled, else blank.
self the average number of milliseconds spent in this
ms/call function per call, if this function is profiled,
else blank.
total the average number of milliseconds spent in this
ms/call function and its descendents per call, if this
function is profiled, else blank.
name the name of the function. This is the minor sort
for this listing. The index shows the location of
the function in the gprof listing. If the index is
in parenthesis it shows where it would appear in
the gprof listing if it were to be printed.
Copyright (C) 2012-2014 Free Software Foundation, Inc.
Copying and distribution of this file, with or without modification,
are permitted in any medium without royalty provided the copyright
notice and this notice are preserved.
Call graph (explanation follows)
granularity: each sample hit covers 2 byte(s) for 0.13% of 7.42 seconds
index % time self children called name
<spontaneous>
[1] 98.7 0.00 7.32 ai_test [1]
0.15 7.17 35/35 ai_best_move_rec [2]
0.00 0.00 34/34 grid_apply_moves [21]
0.00 0.00 2/235469 grid_new [9]
0.00 0.00 35/35 grid_print [24]
0.00 0.00 34/34 game_move_print [27]
0.00 0.00 34/34 shape_stream_pop [28]
0.00 0.00 1/1 shape_stream_new [30]
-----------------------------------------------
235432 ai_best_move_rec [2]
0.15 7.17 35/35 ai_test [1]
[2] 98.7 0.15 7.17 35+235432 ai_best_move_rec [2]
4.00 0.00 4231936/4231936 grid_eval [3]
1.01 0.00 4467368/4467368 shape_stream_test [4]
0.51 0.11 4690742/4690742 grid_block_center_top [5]
0.45 0.09 4521016/4521016 grid_block_valid [6]
0.43 0.00 4467368/4467368 grid_block_add [7]
0.25 0.00 4467368/4467368 grid_block_drop [8]
0.02 0.15 235467/235469 grid_new [9]
0.11 0.00 4467368/4467402 block_move [12]
0.02 0.00 4467368/4467368 grid_block_remove [16]
0.02 0.00 235467/235501 block_new [17]
0.01 0.00 235467/235467 shape_stream_peek [19]
0.00 0.00 587/587 grid_cpy [23]
0.00 0.00 587/621 grid_clear_lines [22]
235432 ai_best_move_rec [2]
-----------------------------------------------
4.00 0.00 4231936/4231936 ai_best_move_rec [2]
[3] 53.9 4.00 0.00 4231936 grid_eval [3]
-----------------------------------------------
1.01 0.00 4467368/4467368 ai_best_move_rec [2]
[4] 13.6 1.01 0.00 4467368 shape_stream_test [4]
-----------------------------------------------
0.51 0.11 4690742/4690742 ai_best_move_rec [2]
[5] 8.4 0.51 0.11 4690742 grid_block_center_top [5]
0.07 0.00 4690742/9212461 block_extreme [11]
0.04 0.00 1056160/1654899 block_get [13]
-----------------------------------------------
0.45 0.09 4521016/4521016 ai_best_move_rec [2]
[6] 7.3 0.45 0.09 4521016 grid_block_valid [6]
0.06 0.00 4521016/9212461 block_extreme [11]
0.03 0.00 598739/1654899 block_get [13]
-----------------------------------------------
0.43 0.00 4467368/4467368 ai_best_move_rec [2]
[7] 5.8 0.43 0.00 4467368 grid_block_add [7]
-----------------------------------------------
0.25 0.00 4467368/4467368 ai_best_move_rec [2]
[8] 3.4 0.25 0.00 4467368 grid_block_drop [8]
0.00 0.00 635/9212461 block_extreme [11]
-----------------------------------------------
0.00 0.00 2/235469 ai_test [1]
0.02 0.15 235467/235469 ai_best_move_rec [2]
[9] 2.2 0.02 0.15 235469 grid_new [9]
0.15 0.00 235469/235469 grid_init [10]
-----------------------------------------------
0.15 0.00 235469/235469 grid_new [9]
[10] 2.0 0.15 0.00 235469 grid_init [10]
-----------------------------------------------
0.00 0.00 68/9212461 grid_apply_moves [21]
0.00 0.00 635/9212461 grid_block_drop [8]
0.06 0.00 4521016/9212461 grid_block_valid [6]
0.07 0.00 4690742/9212461 grid_block_center_top [5]
[11] 1.8 0.13 0.00 9212461 block_extreme [11]
-----------------------------------------------
0.00 0.00 34/4467402 grid_apply_moves [21]
0.11 0.00 4467368/4467402 ai_best_move_rec [2]
[12] 1.5 0.11 0.00 4467402 block_move [12]
-----------------------------------------------
0.03 0.00 598739/1654899 grid_block_valid [6]
0.04 0.00 1056160/1654899 grid_block_center_top [5]
[13] 0.9 0.07 0.00 1654899 block_get [13]
-----------------------------------------------
<spontaneous>
[14] 0.5 0.04 0.00 block_crust_get [14]
-----------------------------------------------
<spontaneous>
[15] 0.4 0.03 0.00 grid_block_set_color [15]
-----------------------------------------------
0.02 0.00 4467368/4467368 ai_best_move_rec [2]
[16] 0.3 0.02 0.00 4467368 grid_block_remove [16]
-----------------------------------------------
0.00 0.00 34/235501 grid_apply_moves [21]
0.02 0.00 235467/235501 ai_best_move_rec [2]
[17] 0.3 0.02 0.00 235501 block_new [17]
-----------------------------------------------
<spontaneous>
[18] 0.3 0.02 0.00 grid_block_intersects [18]
-----------------------------------------------
0.01 0.00 235467/235467 ai_best_move_rec [2]
[19] 0.1 0.01 0.00 235467 shape_stream_peek [19]
-----------------------------------------------
<spontaneous>
[20] 0.1 0.01 0.00 drop_amount [20]
-----------------------------------------------
0.00 0.00 34/34 ai_test [1]
[21] 0.0 0.00 0.00 34 grid_apply_moves [21]
0.00 0.00 34/235501 block_new [17]
0.00 0.00 68/9212461 block_extreme [11]
0.00 0.00 34/4467402 block_move [12]
0.00 0.00 34/34 block_init [25]
0.00 0.00 34/621 grid_clear_lines [22]
-----------------------------------------------
0.00 0.00 34/621 grid_apply_moves [21]
0.00 0.00 587/621 ai_best_move_rec [2]
[22] 0.0 0.00 0.00 621 grid_clear_lines [22]
-----------------------------------------------
0.00 0.00 587/587 ai_best_move_rec [2]
[23] 0.0 0.00 0.00 587 grid_cpy [23]
-----------------------------------------------
0.00 0.00 35/35 ai_test [1]
[24] 0.0 0.00 0.00 35 grid_print [24]
-----------------------------------------------
0.00 0.00 34/34 grid_apply_moves [21]
[25] 0.0 0.00 0.00 34 block_init [25]
-----------------------------------------------
0.00 0.00 34/34 game_move_print [27]
[26] 0.0 0.00 0.00 34 block_print [26]
-----------------------------------------------
0.00 0.00 34/34 ai_test [1]
[27] 0.0 0.00 0.00 34 game_move_print [27]
0.00 0.00 34/34 block_print [26]
-----------------------------------------------
0.00 0.00 34/34 ai_test [1]
[28] 0.0 0.00 0.00 34 shape_stream_pop [28]
-----------------------------------------------
0.00 0.00 7/7 shapes_read [68]
[29] 0.0 0.00 0.00 7 shape_new [29]
-----------------------------------------------
0.00 0.00 1/1 ai_test [1]
[30] 0.0 0.00 0.00 1 shape_stream_new [30]
-----------------------------------------------
This table describes the call tree of the program, and was sorted by
the total amount of time spent in each function and its children.
Each entry in this table consists of several lines. The line with the
index number at the left hand margin lists the current function.
The lines above it list the functions that called this function,
and the lines below it list the functions this one called.
This line lists:
index A unique number given to each element of the table.
Index numbers are sorted numerically.
The index number is printed next to every function name so
it is easier to look up where the function is in the table.
% time This is the percentage of the `total' time that was spent
in this function and its children. Note that due to
different viewpoints, functions excluded by options, etc,
these numbers will NOT add up to 100%.
self This is the total amount of time spent in this function.
children This is the total amount of time propagated into this
function by its children.
called This is the number of times the function was called.
If the function called itself recursively, the number
only includes non-recursive calls, and is followed by
a `+' and the number of recursive calls.
name The name of the current function. The index number is
printed after it. If the function is a member of a
cycle, the cycle number is printed between the
function's name and the index number.
For the function's parents, the fields have the following meanings:
self This is the amount of time that was propagated directly
from the function into this parent.
children This is the amount of time that was propagated from
the function's children into this parent.
called This is the number of times this parent called the
function `/' the total number of times the function
was called. Recursive calls to the function are not
included in the number after the `/'.
name This is the name of the parent. The parent's index
number is printed after it. If the parent is a
member of a cycle, the cycle number is printed between
the name and the index number.
If the parents of the function cannot be determined, the word
`<spontaneous>' is printed in the `name' field, and all the other
fields are blank.
For the function's children, the fields have the following meanings:
self This is the amount of time that was propagated directly
from the child into the function.
children This is the amount of time that was propagated from the
child's children to the function.
called This is the number of times the function called
this child `/' the total number of times the child
was called. Recursive calls by the child are not
listed in the number after the `/'.
name This is the name of the child. The child's index
number is printed after it. If the child is a
member of a cycle, the cycle number is printed
between the name and the index number.
If there are any cycles (circles) in the call graph, there is an
entry for the cycle-as-a-whole. This entry shows who called the
cycle (as parents) and the members of the cycle (as children.)
The `+' recursive calls entry shows the number of function calls that
were internal to the cycle, and the calls entry for each member shows,
for that member, how many times it was called from other members of
the cycle.
Copyright (C) 2012-2014 Free Software Foundation, Inc.
Copying and distribution of this file, with or without modification,
are permitted in any medium without royalty provided the copyright
notice and this notice are preserved.
Index by function name
[2] ai_best_move_rec [21] grid_apply_moves [3] grid_eval
[14] block_crust_get [7] grid_block_add [10] grid_init
[11] block_extreme [5] grid_block_center_top [9] grid_new
[13] block_get [8] grid_block_drop [24] grid_print
[25] block_init [18] grid_block_intersects [29] shape_new
[12] block_move [16] grid_block_remove [30] shape_stream_new
[17] block_new [15] grid_block_set_color [19] shape_stream_peek
[26] block_print [6] grid_block_valid [28] shape_stream_pop
[20] drop_amount [22] grid_clear_lines [4] shape_stream_test
[27] game_move_print [23] grid_cpy
后来我发现我正在组合 -Ofast 和 -pg 标志,删除 -Ofast 可以解决上述问题。
我的问题是:
这两个标志不兼容吗?我在文档中找不到任何此类警告
如果它们不是不兼容的,这是 gcc 错误吗?
如何解决此问题以查看与最激进的优化匹配的分析结果,以免浪费时间优化错误的代码?
【问题讨论】:
【参考方案1】:你想要速度,对吧? -Ofast
仅在调用堆栈的底部很重要。它不能解决只有你能解决的问题。它只会让他们更难找到。首先关闭优化器并修复您可以修复的内容。 Here's an example of how some people do it. 当你尽可能地使用它时,打开优化器并让它发挥它的魔力。
Small Flame :) 很多人说分析未优化的代码是浪费时间,但他们从不说原因。 他们在演讲厅听到它,在博客上或从 *** 上阅读它,从某个在该主题上具有明确权威的人无需给出理由。
我认为这是一厢情愿的情况:我的代码基本上是完美的,因此可以加速它的唯一方法是运行编译器的优化器。然后当分析器显示无法加快速度时 - 这显然是正确的,我为自己欢呼!
为回应 cmets 添加:让我离题一点(抱歉)。我知道人们可以仔细检查代码,在数据表示上投入精力,内联一些函数,进行其他高级优化,并测量这样做的加速。伟大的。 但是,如果您每次更改代码的原因是查看并只是考虑代码,那么尽管您的想法受过教育,但它仍然是一个猜测。 猜测可能是正确的,但您应该问的问题是“我错过了什么?”和“我怎么找到它?”
我和很多人使用的方法是random pausing。 它需要的唯一工具是调试器,如 GDB 或任何 IDE。 它与分析的不同之处在于它直接定位不必要的计算,而不是进行测量并依靠您来破译它们、过滤掉噪音、查看特定的例程并在其中搜索。 它会告诉您具体的代码行和特定的数据、程序在做什么以及为什么要这样做。 然后你可以问是否有一种不那么浪费的方法来做到这一点,而且很有可能。 它不做的是告诉你精确的时间分数——你只能得到一个非常粗略的测量结果。它所做 所做的是识别问题。 您可能认为您当然不能依赖这种不准确的方法,如果您只在 one 示例中看到问题,那是真的。 但是,如果您在多个样本上看到它,您就会知道它很大,而且您为多次查看它而获取的整体样本越少,它就越大。 Here are the statistics behind it. 关键在于:这将发现任何分析器会发现的任何问题,等等。
然后,当您发现并解决了一个问题时,您可以重新进行所有操作,因为解决一个问题会放大剩余的问题。
通过这种方式,您可以“阶梯式”加速,直到无法再加速为止。
这就是你如何获得最大速度as in this example。
然后,一定要使用-Ofast
。
【讨论】:
我认为“只有你可以解决的事情”是指更高级别的优化。但在这种情况下,我对手动进行低级优化很感兴趣。这就是为什么没有 -Ofast 我认为我无法准确表示代码的哪些部分值得我关注。我也很想了解为什么这个不相关的、未被调用的函数出现在统计数据中,标志是否不兼容或者这是否是一个错误。 @erjoalgo:天哪,多点。首先,即使 gprof 没有错误,it has many problems,所以人们可能想知道您为什么要费心与它搏斗。其次,如果速度是您的目标,为什么高级优化不值得您关注?第三,你可以做的任何事情来加速低级代码在未优化的代码中是显而易见的。如果编译器优化了一些低级代码,那不会使 you 可以做的事情更加明显(即使它确实 显示了所有函数)。如果您愿意,我会尝试更详细地解释这一点。 在这个特定的项目中,我确实在数据表示和算法上投入了足够的精力,以最大限度地降低复杂性。所以我已经在高级优化上花费了很多精力,我想抓住低垂的果实。例如,通过手动内联一些计算、远离排序库等,我已经获得了显着的加速。 我同意“你可以做的任何事情来加速低级代码在未优化的代码中是显而易见的”,但我不认为反过来是正确的,优化代码的分析器可能会声明函数 X真正优化好的时候才值得关注,不值得关注。例如,未优化程序图表中的grid_block_add
花费的总程序时间几乎是优化版本的两倍(7% 对 13%)。
我愿意接受您的回答,以便继续进行此操作。你介意添加你将使用的东西而不是 gprof 来分析积极优化的代码吗?以上是关于gcc -Ofast -pg 错误的 gprof 结果的主要内容,如果未能解决你的问题,请参考以下文章