ftrace在实际问题中的应用
Posted 程序猿Ricky的日常干货
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了ftrace在实际问题中的应用相关的知识,希望对你有一定的参考价值。
我在自己的其他博文中有介绍ftrace的介绍和使用方法,那么在实际的工作中,ftrace可以用来做什么呢?
在实际问题场景中ftrace主要用来跟踪延时和调用流程,分析性能问题。
function/function_graph 分析内核函数调用流程
function/function_graph是利用的GCC的编译选项来完成对函数的插桩的,
ftrace自带的function tracer和function_graph这两个tracer内核预定义的连个tracer,一般我们很少用他们去debug具体的性能问题,但是可以用来分析内核函数调用流程的。
function tracer使用如下:
echo function > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
#test work
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace
结果如下:
# tracer: function
#
# entries-in-buffer/entries-written: 261028/11606988 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [006] dn.2 20756.326744: post_ttbr_update_workaround <-__cpu_suspend_exit
<idle>-0 [006] dn.2 20756.326747: uao_thread_switch <-__cpu_suspend_exit
<idle>-0 [006] dn.2 20756.326748: hw_breakpoint_reset <-__cpu_suspend_exit
<idle>-0 [006] dn.2 20756.326749: write_wb_reg <-hw_breakpoint_reset
<idle>-0 [006] dn.2 20756.326749: write_wb_reg <-hw_breakpoint_reset
<idle>-0 [006] dn.2 20756.326750: write_wb_reg <-hw_breakpoint_reset
<idle>-0 [006] dn.2 20756.326750: write_wb_reg <-hw_breakpoint_reset
<idle>-0 [006] dn.2 20756.326751: write_wb_reg <-hw_breakpoint_reset
<idle>-0 [006] dn.2 20756.326751: write_wb_reg <-hw_breakpoint_reset
<idle>-0 [006] dn.2 20756.326752: write_wb_reg <-hw_breakpoint_reset
<idle>-0 [006] dn.2 20756.326752: write_wb_reg <-hw_breakpoint_reset
[...]
function_graph tracer的使用说明如下:
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
#test work
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace
结果如下:
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
6) | finish_task_switch()
6) | _raw_spin_unlock_irq()
6) 0.052 us | do_raw_spin_unlock();
6) 0.104 us | preempt_count_sub();
6) 2.240 us |
6) 3.542 us |
6) 0.156 us | preempt_count_sub();
6) | _raw_spin_lock_irq()
6) 0.364 us | preempt_count_add();
6) 0.052 us | do_raw_spin_lock();
6) 2.656 us |
6) | process_one_work()
6) 0.260 us | find_worker_executing_work();
6) | _raw_spin_lock_irqsave()
6) 0.104 us | preempt_count_add();
6) 0.052 us | do_raw_spin_lock();
6) 1.406 us |
6) | _raw_spin_unlock_irqrestore()
6) 0.052 us | do_raw_spin_unlock();
6) 0.052 us | preempt_count_sub();
6) 1.823 us |
6) 0.053 us | set_work_pool_and_clear_pending();
6) | _raw_spin_unlock_irq()
6) 0.104 us | do_raw_spin_unlock();
6) 0.156 us | preempt_count_sub();
6) 1.979 us |
[...]
上面介绍的两者方式都可以用来查看代码运行流程,但是这两个tracer默认都是跟踪所有的函数,这对于内核来说是一个庞大的性能消耗,因此实际使用时很少会直接这么使用,内核支持使用dynamic ftrace插桩,针对每个函数可以通过debugfs节点选择是否要使能它。如果我们想只跟踪特定函数,可以把需要配置跟踪的函数写入到节点:
/sys/kernel/debug/tracing/set_ftrace_filter ------- function tracer
/sys/kernel/debug/tracing/set_graph_function ------ function_graph tracer
或者想把一些函数排除在外,可以写入如下节点:
/sys/kernel/debug/tracing/set_ftrace_notrace ------- function tracer
/sys/kernel/debug/tracing/set_graph_notrace ------ function_graph tracer
比如我们想要跟踪scheduler_tick函数的运行流程,可以使用如下方式:
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo scheduler_tick > /sys/kernel/debug/tracing/set_graph_function
echo 1 > /sys/kernel/debug/tracing/tracing_on
#test work
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace
抓出来的trace数据如下:
3) | scheduler_tick()
3) | _raw_spin_lock()
3) 0.313 us | preempt_count_add();
3) 0.260 us | do_raw_spin_lock();
3) 5.572 us |
3) 0.261 us | set_window_start();
3) | sched_ktime_clock()
3) | ktime_get()
3) 0.261 us | arch_counter_read();
3) 2.604 us |
3) 5.417 us |
3) | update_task_ravg()
3) | clk_osm_get_cpu_cycle_counter(
3) 0.208 us | logical_cpu_to_clk();
3) 0.208 us | clk_hw_get_parent();
3) | _raw_spin_lock_irqsave()
3) 0.208 us | preempt_count_add();
3) 0.260 us | do_raw_spin_lock();
3) 3.959 us |
[...]
这一段log将只会包含scheduler_tick函数对应的内容,其中还会打印出每个函数消耗的时间,这个时间是包含了sleep time的,所以可以用来定位系统延迟发生在哪个函数中。
这样看这个log很长,也很难分析,可以采用一个vim插件来尝试优化这个问题。
" Enable folding for ftrace function_graph traces.
"
" To use, :source this file while viewing a function_graph trace, or use vim's
" -S option to load from the command-line together with a trace. You can then
" use the usual vim fold commands, such as "za", to open and close nested
" functions. While closed, a fold will show the total time taken for a call,
" as would normally appear on the line with the closing brace. Folded
" functions will not include finish_task_switch(), so folding should remain
" relatively sane even through a context switch.
"
" Note that this will almost certainly only work well with a
" single-CPU trace (e.g. trace-cmd report --cpu 1).
function! FunctionGraphFoldExpr(lnum)
let line = getline(a:lnum)
if line[-1:] == ''
if line =~ 'finish_task_switch() $'
return '>1'
endif
return 'a1'
elseif line[-1:] == ''
return 's1'
else
return '='
endif
endfunction
function! FunctionGraphFoldText()
let s = split(getline(v:foldstart), '|', 1)
if getline(v:foldend+1) =~ 'finish_task_switch() $'
let s[2] = ' task switch '
else
let e = split(getline(v:foldend), '|', 1)
let s[2] = e[2]
endif
return join(s, '|')
endfunction
setlocal foldexpr=FunctionGraphFoldExpr(v:lnum)
setlocal foldtext=FunctionGraphFoldText()
setlocal foldcolumn=12
setlocal foldmethod=expr
把它保存为function-graph-fold.vim,使用如下命令打开trace文件:
vim -S function-graph-fold.vim trace
这一段vim配置可以让function trace进行折叠,利用za可以展开折叠。
- 502 3) | scheduler_tick()
|+ 503 3) | _raw_spin_lock() ---------------------------------
| 507 3) 0.468 us | set_window_start();
|+ 508 3) | sched_ktime_clock() ------------------------------
|+ 513 3) | update_task_ravg() -------------------------------
| 534 3) 0.625 us | update_rq_clock();
|+ 535 3) | task_tick_fair() ---------------------------------
|+ 561 3) | cpu_load_update_active() -------------------------
| 571 3) 0.521 us | calc_global_load_tick();
| 572 3) 0.521 us | early_detection_notify();
|+ 573 3) | _raw_spin_unlock() -------------------------------
| 577 3) 0.417 us | idle_cpu();
|+ 578 3) | trigger_load_balance() ---------------------------
| 586 3) 0.364 us | __rcu_read_lock();
| 587 3) 0.521 us | update_preferred_cluster();
| 588 3) 0.417 us | __rcu_read_unlock();
| 589 3) 0.573 us | check_for_migration();
| 590 3) ! 298.594 us |
event trace/trace point 分析子系统latency
event trace是进行性能分析时最常用的功能,它虽然是ftrace的一部分,但是和上面提到的function trace是不一样的,它只是利用了ftrace框架的ring buffer和debugfs来实现的。它的关键是tracepoint,插桩并不是利用的gcc的编译选项来做的,而是在代码中需要插桩的位置,开发者自行实现的tracepoint。
我们可以通过如下方式跟踪event或者叫tracepoint:
echo nop > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/events/enable
这会默认跟踪内核中所有的tracepoint,打印很多信息,一般很少会这么用,我们会指定特定的一个或者几个tracepoint进行trace操作,系统中存在的所有tracepoint可以通过
cat /sys/kernel/debug/tracing/available_events
进行查看,举例比如:
echo 0 > /sys/kernel/debug/tracing/trace ## clear trace buffer
echo nop > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/events/enable
echo i2c:i2c_write > /sys/kernel/debug/tracing/set_event
echo 1 > /sys/kernel/debug/tracing/tracing_on
#test work
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace
抓取得结果如下:
#
# entries-in-buffer/entries-written: 86/86 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
kworker/0:0-9794 [000] .... 28497.642973: i2c_write: i2c-0 #0 a=00c f=0000 l=3 [10-70-c8]
Binder:865_A-3050 [000] .... 28497.667486: i2c_write: i2c-5 #0 a=026 f=0000 l=2 [1e-01]
Binder:865_A-3050 [000] .... 28497.684999: i2c_write: i2c-5 #0 a=026 f=0000 l=2 [1e-01]
Binder:865_A-3050 [002] .... 28497.705814: i2c_write: i2c-5 #0 a=026 f=0000 l=2 [1e-01]
Binder:865_A-3050 [000] .... 28497.725763: i2c_write: i2c-5 #0 a=026 f=0000 l=2 [07-01]
kworker/u16:0-6 [002] .... 28497.781856: i2c_write: i2c-5 #0 a=026 f=0000 l=2 [07-00]
nfc@1.1-service-3257 [001] .... 28497.812347: i2c_write: i2c-2 #0 a=028 f=0000 l=4 [21-06-01-00]
kworker/u16:0-6 [002] .... 28497.814925: i2c_write: i2c-5 #0 a=026 f=0000 l=1 [00]
[...]
各个子系统tracepoint的结果是各不相同的,是可以自定义print format的,那么抓出如上的数据,要如何分析呢?可以通过查看如下节点来看print的内容表示什么含义,比如本例:
cat /sys/kernel/debug/tracing/events/i2c/i2c_write/format
name: i2c_write
ID: 805
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int adapter_nr; offset:8; size:4; signed:1;
field:__u16 msg_nr; offset:12; size:2; signed:0;
field:__u16 addr; offset:14; size:2; signed:0;
field:__u16 flags; offset:16; size:2; signed:0;
field:__u16 len; offset:18; size:2; signed:0;
field:__data_loc __u8[] buf; offset:20; size:4; signed:0;
print fmt: "i2c-%d #%u a=%03x f=%04x l=%u [%*phD]", REC->adapter_nr, REC->msg_nr, REC->addr, REC->flags, REC->len, REC->len, __get_dynamic_array(buf)
由于这个event trace在日常工作中是非常关键的功能,我再举一个例子,比如遇到io导致的延迟问题,那么可以利用block:block_rq_insert这个tracepoint去跟踪io的情况:
echo 0 > /sys/kernel/debug/tracing/trace ## clear trace buffer
echo nop > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/events/enable
echo block:block_rq_insert > /sys/kernel/debug/tracing/set_event
echo 1 > /sys/kernel/debug/tracing/tracing_on
#test work
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace
运行结果如下:
# tracer: nop
#
# entries-in-buffer/entries-written: 26/26 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
kworker/3:0-9167 [003] d..1 29854.965993: block_rq_insert: 0,0 N 0 () 0 + 0 [kworker/3:0]
kworker/3:0-9167 [003] d..1 29854.966260: block_rq_insert: 0,0 N 0 () 0 + 0 [kworker/3:0]
kworker/3:1H-347 [003] d..1 29854.966367: block_rq_insert: 0,0 N 0 () 0 + 0 [kworker/3:1H]
kworker/3:0-9167 [003] d..1 29854.969788: block_rq_insert: 0,0 N 0 () 0 + 0 [kworker/3:0]
SettingsProvide-1641 [006] d..1 29855.789272: block_rq_insert: 8,0 WS 20480 () 116198528 + 40 [SettingsProvide]
kworker/6:2-6717 [006] d..1 29855.794193: block_rq_insert: 0,0 N 0 () 0 + 0 [kworker/6:2]
jbd2/dm-0-8-631 [002] d..1 29855.801276: block_rq_insert: 8,0 WS 53248 () 32244024 + 104 [jbd2/dm-0-8]
jbd2/dm-0-8-631 [001] d..1 29855.804316: block_rq_insert: 8,0 FWS 0 () 0 + 0 [jbd2/dm-0-8]
kworker/0:1-9897 [000] d..1 29855.805400: block_rq_insert: 8,0 WFS 4096 () 32244128 + 8 [kworker/0:1]
kworker/u16:3-8416 [001] d..1 29856.410411: block_rq_insert: 8,0 WM 4096 () 15194696 + 8 [kworker/u16:3]
kworker/u16:3-8416 [001] d..1 29856.410470: block_rq_insert: 8,0 WM 8192 () 15469120 + 16 [kworker/u16:3]
kworker/u16:3-8416 [001] d..1 29856.410481: block_rq_insert: 8,0 WM 4096 () 44554824 + 8 [kworker/u16:3]
kworker/u16:3-8416 [001] d..1 29856.410487: block_rq_insert: 8,0 WM 4096 () 44555368 + 8 [kworker/u16:3]
kworker/u16:3-8416 [001] d..1 29856.410493: block_rq_insert: 8,0 WM 4096 () 115857992 + 8 [kworker/u16:3]
kworker/u16:3-8416 [001] d..1 29856.410498: block_rq_insert: 8,0 WM 4096 () 115858544 + 8 [kworker/u16:3]
[...]
更加详细的可以去跟踪一下调用栈:
echo 0 > /sys/kernel/debug/tracing/trace ## clear trace buffer
echo nop > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/events/enable
echo block:block_rq_insert > /sys/kernel/debug/tracing/set_event
echo 1 > /sys/kernel/debug/tracing/options/stacktrace
echo 1 > /sys/kernel/debug/tracing/tracing_on
#test work
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace
它的打印是这样的:
# tracer: nop
#
# entries-in-buffer/entries-written: 450/450 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
kworker/0:0-9953 [000] d..1 30109.045297: block_rq_insert: 0,0 N 0 () 0 + 0 [kworker/0:0]
kworker/0:0-9953 [000] d..1 30109.045320: <stack trace>
=> blk_execute_rq
=> scsi_execute
=> sd_sync_cache
=> sd_suspend_common
=> sd_suspend_runtime
=> scsi_runtime_suspend
=> __rpm_callback
=> rpm_callback
=> rpm_suspend
=> pm_runtime_work
=> process_one_work
=> worker_thread
=> kthread
=> ret_from_fork
kworker/0:0-9953 [000] d..1 30109.045429: block_rq_insert: 0,0 N 0 () 0 + 0 [kworker/0:0]
kworker/0:0-9953 [000] d..1 30109.045435: <stack trace>
=> blk_requeue_request
=> __scsi_queue_insert
=> scsi_queue_insert
此处我们也可以采用function graph去进一步查看某个函数调用信息:
echo 0 > /sys/kernel/debug/tracing/events/enable
echo 0 > /sys/kernel/debug/tracing/options/stacktrace
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo blk_requeue_request > /sys/kernel/debug/tracing/set_graph_function
echo 1000 > /sys/kernel/debug/tracing/tracing_thresh ## latency threhold us
echo 1 > /sys/kernel/debug/tracing/options/funcgraph-duration
echo 1 > /sys/kernel/debug/tracing/options/funcgraph-abstime
echo 1 > /sys/kernel/debug/tracing/options/funcgraph-proc
echo 1 > /sys/kernel/debug/tracing/tracing_on
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace
设置latency threhold可以过滤掉很多低延时的函数:
# tracer: function_graph
#
# TIME CPU TASK/PID DURATION FUNCTION CALLS
# | | | | | | | | | |
33224.935535 | 6) kworker-6717 | $ 486047429 us | /* schedule */
33224.965690 | 6) kworker-6717 | * 30072.18 us | /* schedule */
33225.047661 | 4) kworker-399 | $ 1006941657 us | /* schedule */
33225.846533 | 6) kworker-6717 | @ 880788.9 us | /* schedule */
33225.929458 | 6) kworker-6717 | * 82793.12 us | /* schedule */
33234.967422 | 6) kworker-6717 | $ 9037820 us | /* schedule */
ftrace应用层工具
上面的功能都是通过操作ftrace框架的接口来抓取trace数据的,那么实际上性能大神Brendan Gregg已经实现了一个应用层的perf-tools:https://github.com/brendangregg/perf-tools.git. 使用这个perf tools,可以快速的通过脚本完成上面所介绍的功能,并且比上面介绍的功能更加强大。
参考:
Documentation/trace/function-graph-fold.vim
http://lwn.net/Articles/608497
以上是关于ftrace在实际问题中的应用的主要内容,如果未能解决你的问题,请参考以下文章
Debugging the kernel using Ftrace
Debugging the kernel using Ftrace