内核追踪技术之 ftrace
Posted 为了维护世界和平_
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了内核追踪技术之 ftrace相关的知识,希望对你有一定的参考价值。
目录
一、/sys/kernel/debug/trace 目录下的文件 常用的选项
一、/sys/kernel/debug/trace 目录下的文件 常用的选项
- available_tracers 当前系统所支持的跟踪器
- available_events 当前系统所支持的事件
- current_tracer 设置和显示当前正在使用的跟踪器,默认nop
- trace 读取跟踪信息
- tracing_on 用于开始或暂停跟踪
- trace_points 设置ftrace的一些相关选项
查看系统支持哪些跟踪器available_tracers
#:/sys/kernel/debug/tracing# cat available_tracers
hwlat blk mmiotrace function_graph wakeup_dl wakeup_rt wakeup function nop
- hwlat 与硬件相关的延时
- blk 跟踪块设备函数
- mmiotrace 跟踪内存映射I/O操作
- function_graph 函数调用关系图
- wakeup 跟踪普通优先级的进程 从获得调度到被唤醒的最长延迟时间
- wakeup_rt 跟踪RT类型的任务从获得调度到被唤醒的最长延迟时间
- function 跟踪内核函数执行情况
1、function_graph跟踪器
function_graph > current_tracer
echo 1 > tracing_on
...
echo 0 > tracing_on
cat trace
函数调用关系图
1) 0.081 us | get_xsave_addr();
1) 0.079 us | finish_task_switch();
1) + 25.319 us | /* schedule */
1) + 25.472 us | /* exit_to_usermode_loop */
1) 0.076 us | fpregs_assert_state_consistent();
1) 0.121 us | switch_fpu_return();
1) + 35.360 us |
1) | do_syscall_64()
1) | __x64_sys_sendmsg()
1) | __sys_sendmsg()
1) | sockfd_lookup_light()
1) | __fdget()
1) | __fget_light()
1) 0.110 us | __fget();
1) 0.262 us |
1) 0.414 us |
1) 0.579 us |
1) | ___sys_sendmsg()
1) | copy_msghdr_from_user()
1) | rw_copy_check_uvector()
1) | __check_object_size()
1) 0.092 us | check_stack_object();
1) 0.246 us |
1) 0.413 us |
1) 0.655 us |
1) | ____sys_sendmsg()
1) | sock_sendmsg()
1) | security_socket_sendmsg()
1) | apparmor_socket_sendmsg()
1) 0.074 us | aa_unix_msg_perm();
1) 0.224 us |
1) 0.405 us |
1) | unix_stream_sendmsg()
1) | wait_for_unix_gc()
1) | _cond_resched()
1) 0.075 us | rcu_all_qs();
1) 0.241 us |
1) 0.417 us |
2、function跟踪器
function会跟踪当前系统中所有的函数,如果想跟踪单个进程则使用set_ftrace_pid
#cat set_ftrace_pid
no pid
#echo 108432 > set_ftrace_pid
#cat set_ftrace_pid
108432
#echo function > current_tracer
#cat trace
二、查看系统支持哪些事件
cat /sys/kernel/debug/tracing/available_events
initcall:initcall_level
initcall:initcall_start
initcall:initcall_finish
raw_syscalls:sys_enter
raw_syscalls:sys_exit
syscalls:sys_enter_rt_sigreturn
syscalls:sys_exit_rt_sigreturn
syscalls:sys_enter_mmap
syscalls:sys_exit_mmap
...
1、事件跟踪
两种跟踪机制:函数和跟踪点,前者属于简单的操作,后者可以输出开发者想要的参数、局部变量等信息。
跟踪点的位置比较固定,一般是内核开发者添加上去的,如果在运行时没有开启DEBUG,那么不占用任何系统开销的。
在源码中以trace_开头的函数,是ftrace的跟踪点。如:update_curr
static void update_curr(struct cfs_rq *cfs_rq)
if (entity_is_task(curr))
struct task_struct *curtask = task_of(curr);
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cgroup_account_cputime(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
trace_sched_stat_runtime 是使用了sched_stat_runtime跟踪点,在命令行下也可以查询到
筛选 grep sched_stat_runtime
benshushu:tracing# cat /sys/kernel/debug/tracing/available_events | grep sched_stat_runtime
sched:sched_stat_runtime
找到跟踪点后如何跟踪这个事件
# echo sched:sched_stat_runtime > /sys/kernel/debug/tracing/set_event
# [ 4664.128181] Scheduler tracepoints stat_sleep, stat_iowait, stat_blocked and stat_runtime require the kernel parameter schedstats=enable or kernel.sched_schedstats=1
# echo 1 > tracing_on
# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 217/217 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
bash-678 [003] d... 4749.608750: sched_stat_runtime: comm=bash pid=678 runtime=3010992 [ns] vruntime=3326653851 [ns]
kworker/3:2-365 [003] d... 4749.611590: sched_stat_runtime: comm=kworker/3:2 pid=365 runtime=336064 [ns] vruntime=35834438828 [ns]
rcu_sched-10 [002] d... 4749.611633: sched_stat_runtime: comm=rcu_sched pid=10 runtime=142128 [ns] vruntime=54819089300 [ns]
rcu_sched-10 [002] d... 4749.618566: sched_stat_runtime: comm=rcu_sched pid=10 runtime=234512 [ns] vruntime=54819323812 [ns]
rcu_sched-10 [002] d... 4749.625277: sched_stat_runtime: comm=rcu_sched pid=10 runtime=244384 [ns] vruntime=54819568196 [ns]
kworker/0:1-786 [000] d... 4750.670649: sched_stat_runtime: comm=kworker/0:1 pid=786 runtime=274624 [ns] vruntime=33254509938 [ns]
kworker/u8:1-47 [002] d... 4751.269334: sched_stat_runtime: comm=kworker/u8:1 pid=47 runtime=323152 [ns] vruntime=54819552788 [ns]
解释标头
benshushu:tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 217/217 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
bash-678 [003] d... 4749.608750: sched_stat_runtime: comm=bash pid=678 runtime=3010992 [ns] vruntime=3326653851 [ns]
TASK_PID 任务的PID
irq-off:d表示中断已经关闭;若设置"." 表示中断没有关闭
need-resched: 是否需要调度
N 表示设置了TIF_NEED_RESCHED和PREEMPT_NEED_RESCHED标志位,需要被调度
n 表示设置了TIF_NEED_RESCHED
p表示设置了PREEMPT_NEED_RESCHED
. 表示不需要调度
hardirq/softirq:表示是否发生了软中断或硬中断
H 表示在一次硬中断中发生了一次硬中断
h 表示硬中断发生
s 表示软中断
.表示没有中断发生
preempt-depth 表示抢占关闭的嵌套层级
delay 表示延迟的时间长度
$ 长于1s
@长于100ms
* 长于10ms
# 长于1000us
! 长于100us
+ 长于10us
设置跟踪条件
#:/sys/kernel/debug/tracing/events/sched/sched_stat_runtime# cat format
name: sched_stat_runtime
ID: 311
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char comm[16]; offset:8; size:16; signed:1;
field:pid_t pid; offset:24; size:4; signed:1;
field:u64 runtime; offset:32; size:8; signed:0;
field:u64 vruntime; offset:40; size:8; signed:0;
print fmt: "comm=%s pid=%d runtime=%Lu [ns] vruntime=%Lu [ns]", REC->comm, REC->pid, (unsigned long long)REC->runtime, (unsigned long long)REC->vruntime
跟踪点支持8个域,前四个是通用域,后4个是跟踪点支持的域。而comm是一个字符串域,其他的域是数字域。
设置事件后,还可以对事件进行过滤
//切换到事件下
/sys/kernel/debug/tracing/events/sched/sched_stat_runtime#
//跟踪sh开头的进程
echo 'comm ~ "sh*" ' > filter
2、动态ftrace
在实际调试过程中,所需要的信息会被大量的ftrace覆盖,所以动态过滤方法很有用。
set_ftrace_filter和set_ftrace_notrace选项配对使用。前者设置要跟踪的函数,后者设置不要跟踪的函数。
available_filter_function文件可以列出当前系统支持的所有函数
#cat available_filter_functions
....
__ip6_flush_pending_frames
ip6_flush_pending_frames
ip6_autoflowlabel.part.0
ip6_finish_output2
ip6_copy_metadata
ip6_fraglist_prepare
ip6_frag_next
ip6_fraglist_init
ip6_setup_cork
ip6_sk_dst_lookup_flow
ip6_xmit
....
如 只关注hrtimer_interrupt函数
#echo hrtimer_interrupt > set_ftrace_filter
# echo 1 > tracing_on
...
# echo 0 > tracing_on
# cat trace
三、添加跟踪点
内核中的跟踪点不能满足要求,可手动添加。
以update_curr()为例,观察cfs_rq就绪队列中min_vruntime成员变化情况。
首先需要在include/trace/events/sched.h 的头文件中添加一个名为sched_stat_minvruntime的跟踪点。
按照宏填写参数
#define TRACE_EVENT(name,proto,args,struct,assign,print) \\
DECLARE_TRACE(name,PARAMS(proto),PATAMS(args))
TRACE_EVENT(sched_stat_minvruntime,
TP_PROTO(struct task_struct *tsk,u64 minvruntime),
TP_ARGS(tsk,minvruntime),
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
__field( u64, vruntime)
),
TP_fast_assign(
memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
__entry->pid = t->pid;
__entry->vruntime =minruntime;
),
TP_printk("comm=%s pid=%d vruntime=%Lu [ns]", __entry->comm, __entry->pid,
(unsigned long long)__entry->vruntime)
);
- name: 跟踪点名字
- proto:该跟踪点的原型
- args:参数
- struct:定义跟踪器内部使用的_entry数据结构
- assign:把参数复制到__entry数据结构中
- print:输出的格式
把trace_sched_stat_minvruntime()添加到update_curr()函数中
static void update_curr(struct cfs_rq *cfs_rq)
if (entity_is_task(curr))
struct task_struct *curtask = task_of(curr);
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
trace_sched_stat_minvruntime(curtask,cfs_rq->mim_vruntime);
重新编译内核,在sys节点上看到添加跟踪节点。
cd /sys/kernel/debug/tracing/events/sched/sched_stat_minvruntime/
# ls
enable filter format id trigger
# cat format
name: sched_stat_minvruntime
ID: 171
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char comm[16]; offset:8; size:16; signed:0;
field:pid_t pid; offset:24; size:4; signed:1;
field:u64 vruntime; offset:32; size:8; signed:0;
print fmt: "comm=%s pid=%d vruntime=%Lu [ns]", REC->comm, REC->pid, (unsigned long long)REC->vruntime
查看添加的事件
tracing# echo sched:sched_stat_minvruntime > set_event
tracing# echo 1 > tracing_on
tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 587/587 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
bash-555 [003] d.h. 236.982374: sched_stat_minvruntime: comm=bash pid=555 vruntime=1669227760 [ns]
bash-555 [003] d.h. 236.985426: sched_stat_minvruntime: comm=bash pid=555 vruntime=1672976144 [ns]
bash-555 [003] d... 236.986699: sched_stat_minvruntime: comm=bash pid=555 vruntime=1674223536 [ns]
rcu_sched-10 [002] d... 236.986838: sched_stat_minvruntime: comm=rcu_sched pid=10 vruntime=23631132868 [ns]
rcu_sched-10 [002] d... 236.994794: sched_stat_minvruntime: comm=rcu_sched
四、跟踪标记
trace_marker,可以跟踪用户程序。trace_marker是一个文件节点,允许用户程序写入字符串,ftrace会记录该写入动作的时间戳。
可以观察用户程序在两个跟踪标记之间的内核空间中发生的事情
trace_marker所在的文件路径
/sys/kernel/debug/tracing/trace_marker
int main()
int fd_mark = open("/sys/kernel/debug/tracing/trace_marker", O_CREAT|O_RDWR, 0666);
int fd_trace = open("/sys/kernel/debug/tracing/tracing_on", O_CREAT|O_RDWR, 0666);
write(fd_trace, "1", 2);
//添加marker
write(fd_mark, "start time...", 11);
/* to do something */
write(fd_mark, "end time...", 11);
write(fd_trace, "0", 2);
close(fd_mark);
close(fd_trace);
内核提供的测试模块
sample/trace_events目录
五、trace-cmd, kerneshark
trace-cmd 遵循reset->record->stop->report,使用report收集数据,Ctrl+C终止,生成trace.data文件。
追踪
trace-cmd record -e 'sched_wakeup*' -e sched_switch -e 'sched_migrate*'
解析 使用trace-cmd report解析data.dat文件 文字形式
benshushu:~# trace-cmd report trace.dat
cpus=4
trace-cmd-6338 [003] 599.589634: sched_wakeup: rcu_sched:10 [120] success=1 CPU:003
trace-cmd-6320 [000] 599.589635: sched_switch: trace-cmd:6320 [120] R ==> trace-cmd:6337 [120]
trace-cmd-6338 [003] 599.590964: sched_wakeup: ksoftirqd/3:26 [120] success=1 CPU:003
trace-cmd-6338 [003] 599.591186: sched_switch: trace-cmd:6338 [120] R ==> rcu_sched:10 [120]
rcu_sched-10 [003] 599.591324: sched_switch: rcu_sched:10 [120] W ==> ksoftirqd/3:26 [120]
ksoftirqd/3-26 [003] 599.591380: sched_switch: ksoftirqd/3:26 [120] S ==> trace-cmd:6338 [120]
trace-cmd-6336 [002] 599.592586: sched_switch: trace-cmd:6336 [120] S ==> swapper/2:0 [120]
<idle>-0 [002] 599.592718: sched_wakeup: dhcpcd:733 [120] success=1 CPU:002
trace-cmd-6335 [001] 599.592781: sched_switch: trace-cmd:6335 [120] S ==> trace-cmd:6320 [120]
<idle>-0 [002] 599.592887: sched_wakeup: rcu_sched:10 [120] success=1 CPU:003
<idle>-0 [002] 599.592928: sched_switch: swapper/2:0 [120] R ==> dhcpcd:733 [120]
dhcpcd-733 [002] 599.594394: sched_migrate_task: comm=kworker/u8:2 pid=151 prio=120 orig_cpu=0 dest_cpu=2
dhcpcd-733 [002] 599.594628: sched_wakeup: kworker/u8:2:151 [120] success=1 CPU:002
dhcpcd-733 [002] 599.594854: sched_switch: dhcpcd:733 [120] D ==> kworker/u8:2:151 [120]
kworker/u8:2-151 [002] 599.595194: sched_wakeup_new: kworker/u8:2:6339 [120] success=1 CPU:002
kworker/u8:2-151 [002] 599.595248: sched_switch: kworker/u8:2:151 [120] S ==> kworker/u8:2:6339 [120]
kernelshark是以图形的形式解析data.dat文件
kernelshark trace.dat
待补充
以上是关于内核追踪技术之 ftrace的主要内容,如果未能解决你的问题,请参考以下文章
Debugging the kernel using Ftrace
Debugging the kernel using Ftrace