do_fork实现--上
Posted Loopers
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了do_fork实现--上相关的知识,希望对你有一定的参考价值。
在前面几节中讲述了如何通过fork, vfork, pthread_create去创建一个进程,或者一个线程。通过分析最终fork, vfork, pthread_create最终会通过系统调用clone去创建进程。
今天我们就重点来分析clone之后的事情。为了学习今天这一节,前面的都是铺垫。
既然fork, vfork, pthread_create都去调用到clone这个系统调用,我们就沿着这个线索往下继续分析。
SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
int __user *, parent_tidptr,
int __user *, child_tidptr,
unsigned long, tls)
return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
在fork.c的文件中我们找到了clone的系统调用的实现,可以看出clone直接调用了_do_fork函数。
/*
* Ok, this is the main fork-routine.
*
* It copies the process, and if successful kick-starts
* it and waits for it to finish using the VM if required.
*/
long _do_fork(unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *parent_tidptr,
int __user *child_tidptr,
unsigned long tls)
struct completion vfork;
struct pid *pid;
struct task_struct *p;
int trace = 0;
long nr;
/*
* Determine whether and which event to report to ptracer. When
* called from kernel_thread or CLONE_UNTRACED is explicitly
* requested, no event is reported; otherwise, report if the event
* for the type of forking is enabled.
*/
if (!(clone_flags & CLONE_UNTRACED))
if (clone_flags & CLONE_VFORK)
trace = PTRACE_EVENT_VFORK;
else if ((clone_flags & CSIGNAL) != SIGCHLD)
trace = PTRACE_EVENT_CLONE;
else
trace = PTRACE_EVENT_FORK;
if (likely(!ptrace_event_enabled(current, trace)))
trace = 0;
p = copy_process(clone_flags, stack_start, stack_size,
child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
add_latent_entropy();
if (IS_ERR(p))
return PTR_ERR(p);
/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
*/
trace_sched_process_fork(current, p);
pid = get_task_pid(p, PIDTYPE_PID);
nr = pid_vnr(pid); //nr就是父子进程的返回值。子进程会返回0,而父进程会返回子进程的pid的
if (clone_flags & CLONE_PARENT_SETTID)
put_user(nr, parent_tidptr);
if (clone_flags & CLONE_VFORK) //如果flag中存在CLONE_VFORK,说明是通过vfork来创建子进程,这时候需要初始化一个完成量
p->vfork_done = &vfork;
init_completion(&vfork);
get_task_struct(p);
wake_up_new_task(p); //唤醒子进程
/* forking complete and child started to run, tell ptracer */
if (unlikely(trace))
ptrace_event_pid(trace, pid);
if (clone_flags & CLONE_VFORK)
if (!wait_for_vfork_done(p, &vfork))
ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
put_pid(pid);
return nr;
OK! 从注释上可以看出do_fork函数就是fork的主要流程了。今天我们重点分析此函数。
- copy_process是创建一个进程的主要函数,里面功能主要是负责拷贝父进程的相关资源。待会详细分析
- get_task_pid(p, PIDTYPE_PID);当copy_process成功返回时说明子进程已经创建成功了,这时候需要给子进程分配一个pid
- wake_up_new_task 这时候就需要将子进程加入到就绪队列中去,至于何时被调度是调度器说了算
- wait_for_vfork_done(p, &vfork)): 如果存在CLONE_VFORK标准,则子进程需要先运行,父进程需要在这里等待。
- 这就是为什么前面使用vfork在创建一个进程的时候,子进程先运行,而父进程后运行。
copy_process分析
copy_process函数的实现比较长,我们采用分段的方式来分析
static __latent_entropy struct task_struct *copy_process(
unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
int trace,
unsigned long tls,
int node)
int retval;
struct task_struct *p;
struct multiprocess_signals delayed;
/*
* Don't allow sharing the root directory with processes in a different
* namespace
*/
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
if ((clone_flags & (CLONE_NEWUSER|CLONE_FS)) == (CLONE_NEWUSER|CLONE_FS))
return ERR_PTR(-EINVAL);
/*
* Thread groups must share signals as well, and detached threads
* can only be started up within the thread group.
*/
if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
return ERR_PTR(-EINVAL);
/*
* Shared signal handlers imply shared VM. By way of the above,
* thread groups also imply shared VM. Blocking this case allows
* for various simplifications in other code.
*/
if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
return ERR_PTR(-EINVAL);
/*
* Siblings of global init remain as zombies on exit since they are
* not reaped by their parent (swapper). To solve this and to avoid
* multi-rooted process trees, prevent global and container-inits
* from creating siblings.
*/
if ((clone_flags & CLONE_PARENT) &&
current->signal->flags & SIGNAL_UNKILLABLE)
return ERR_PTR(-EINVAL);
/*
* If the new process will be in a different pid or user namespace
* do not allow it to share a thread group with the forking task.
*/
if (clone_flags & CLONE_THREAD)
if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) ||
(task_active_pid_ns(current) !=
current->nsproxy->pid_ns_for_children))
return ERR_PTR(-EINVAL);
刚开始上来检查传递的flag是不是存在冲突,我们找几个简单描述下,不全部描述
- CLONE_NEWNS和CLONE_FS注释说的很清楚了,不允许再不同的命名空间中空闲文件系统资源,命名空间本身的设计就是用来隔离进程之间的资源互相访问的
- CLONE_THREAD和CLONE_SIGHAND需要配合使用,THread线程必须共享信号资源
struct task_struct *p;
if (signal_pending(current))
goto fork_out;
p = dup_task_struct(current, node);
- current代表的父进程的,当父进程收到信号中断时,需要fork_out出去。比如父进程突然收到了kill -9的信号,则肯定不能完成子进程的创建
- dup_task_struct函数主要是为子进程创建一个新的task_struct结构,然后复制父进程的task_struct结构到子进程新创建的。
static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
struct task_struct *tsk;
unsigned long *stack;
struct vm_struct *stack_vm_area __maybe_unused;
int err;
if (node == NUMA_NO_NODE)
node = tsk_fork_get_node(orig);
tsk = alloc_task_struct_node(node);
stack = alloc_thread_stack_node(tsk, node);
stack_vm_area = task_stack_vm_area(tsk);
err = arch_dup_task_struct(tsk, orig);
tsk->stack = stack;
setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
clear_tsk_need_resched(tsk);
set_task_stack_end_magic(tsk);
return tsk;
对dup_task_struct函数的代码做了一些精简,去掉了一些错误的判断。
- alloc_task_struct_node为子进程分配一个task_struct结构
- alloc_thread_stack_node为子进程一个内核栈,上面的两个分配函数都是借用了slab机制,创建的高速缓存池来分配,加速了进程创建的效率。
- arch_dup_task_struct将父进程task_struct的内容复制给子进程的task_struct
- tsk->stack = stack; 设置子进程的内核栈
- setup_thread_stack建立thread_info和内核栈的关系,因为ARM64采用的是thread_info存储在task_struct结构中,则此函数为空
- clear_tsk_need_resched: 清空子进程需要调度的标志位
- set_task_stack_end_magic:设置子进程内核栈结束处的幻术,为了踩踏内核栈做debug功能使用。
retval = copy_creds(p, clone_flags);
if (retval < 0)
goto bad_fork_free;
/*
* If multiple threads are within copy_process(), then this check
* triggers too late. This doesn't hurt, the check is only there
* to stop root fork bombs.
*/
retval = -EAGAIN;
if (nr_threads >= max_threads)
goto bad_fork_cleanup_count;
int nr_threads; /* The idle threads do not count.. */
int max_threads; /* tunable limit on nr_threads */
- copy_creds用户拷贝父进程的证书
- nr_threads代表当前进程数,max_threads代表最大的。如果当前进程数超过最大的设置,则创建子进程失败,退出fork流程
INIT_LIST_HEAD(&p->children);
INIT_LIST_HEAD(&p->sibling);
rcu_copy_process(p);
p->vfork_done = NULL;
spin_lock_init(&p->alloc_lock);
init_sigpending(&p->pending);
p->utime = p->stime = p->gtime = 0;
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
p->utimescaled = p->stimescaled = 0;
#endif
prev_cputime_init(&p->prev_cputime);
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
seqcount_init(&p->vtime.seqcount);
p->vtime.starttime = 0;
p->vtime.state = VTIME_INACTIVE;
#endif
。。。。。。
接下来一大段是初始化子进程的Task_struct结构,刚才子进程的task_struct是从父进程哪里copy过来的,有些字段子进程需要重新进化设置,不需要父进程的东西,则需要做一次初始化操作。
/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
if (retval)
goto bad_fork_cleanup_policy;
retval = perf_event_init_task(p);
if (retval)
goto bad_fork_cleanup_policy;
retval = audit_alloc(p);
if (retval)
goto bad_fork_cleanup_perf;
/* copy all the process information */
shm_init_task(p);
retval = security_task_alloc(p, clone_flags);
if (retval)
goto bad_fork_cleanup_audit;
retval = copy_semundo(clone_flags, p);
if (retval)
goto bad_fork_cleanup_security;
retval = copy_files(clone_flags, p);
if (retval)
goto bad_fork_cleanup_semundo;
retval = copy_fs(clone_flags, p);
if (retval)
goto bad_fork_cleanup_files;
retval = copy_sighand(clone_flags, p);
if (retval)
goto bad_fork_cleanup_fs;
retval = copy_signal(clone_flags, p);
if (retval)
goto bad_fork_cleanup_sighand;
retval = copy_mm(clone_flags, p);
if (retval)
goto bad_fork_cleanup_signal;
retval = copy_namespaces(clone_flags, p);
if (retval)
goto bad_fork_cleanup_mm;
retval = copy_io(clone_flags, p);
if (retval)
goto bad_fork_cleanup_namespaces;
retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
if (retval)
goto bad_fork_cleanup_io;
接着又是一大段的copy操作。我们在讲解fork创建进程的时候说过,fork是将父进程的资源统统的做一次拷贝的操作。这就是真真的拷贝的操作。
整个copy的动作很多,基本都是对父进程资源的拷贝,我们这里简单的举几个例子详细说明下,就举我们经常会涉及到的,sched_fork, copy_fs, copy_mm。
新创建的进程肯定要参与到调度啊,所以首要先初始化子进程调度相关的信息。
int sched_fork(unsigned long clone_flags, struct task_struct *p)
unsigned long flags;
__sched_fork(clone_flags, p);
/*
* We mark the process as NEW here. This guarantees that
* nobody will actually run it, and a signal or other external
* event cannot wake it up and insert it on the runqueue either.
*/
p->state = TASK_NEW;
/*
* Make sure we do not leak PI boosting priority to the child.
*/
p->prio = current->normal_prio;
/*
* Revert to default priority/policy on fork if requested.
*/
if (unlikely(p->sched_reset_on_fork))
if (task_has_dl_policy(p) || task_has_rt_policy(p))
p->policy = SCHED_NORMAL;
p->static_prio = NICE_TO_PRIO(0);
p->rt_priority = 0;
else if (PRIO_TO_NICE(p->static_prio) < 0)
p->static_prio = NICE_TO_PRIO(0);
p->prio = p->normal_prio = __normal_prio(p);
set_load_weight(p, false);
/*
* We don't need the reset flag anymore after the fork. It has
* fulfilled its duty:
*/
p->sched_reset_on_fork = 0;
if (dl_prio(p->prio))
return -EAGAIN;
else if (rt_prio(p->prio))
p->sched_class = &rt_sched_class;
else
p->sched_class = &fair_sched_class;
init_entity_runnable_average(&p->se);
/*
* The child is not yet in the pid-hash so no cgroup attach races,
* and the cgroup is pinned to this child due to cgroup_fork()
* is ran before sched_fork().
*
* Silence PROVE_RCU.
*/
raw_spin_lock_irqsave(&p->pi_lock, flags);
/*
* We're setting the CPU for the first time, we don't migrate,
* so use __set_task_cpu().
*/
__set_task_cpu(p, smp_processor_id());
if (p->sched_class->task_fork)
p->sched_class->task_fork(p);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
init_task_preempt_count(p);
return 0;
- __sched_fork: 此函数主要是对task_struct中调度相关的信息进行初始化,比如下面就是对调度实体进行初始化,se是sched_entity的缩写。有关系调度的知识我们在调度小节详细描述
p->on_rq = 0;
p->se.on_rq = 0;
p->se.exec_start = 0;
p->se.sum_exec_runtime = 0;
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);
- p->state = TASK_NEW; 表示这是一个新创建的进程
-
p->prio = current→normal_prio; 设置新创建进程的优先级,优先级是跟随当前进程的
-
sched_reset_on_fork如果设置的话,意思是当前进程调用sched_setscheduler函数重新设置了进程的优先级和调度策略。这里需要revert进程的优先级和调度策略。
-
if (dl_prio(p→prio)) 如果当前进程是DL(DeadLine)进程的话,则会返回EAGAIN。 因为DL进程不允许分裂,也就是创建子进程。
-
if (rt_prio(p→prio)) 如果当前进程是RT(ReadTime)进程,也就是实时进程的话,设置进程的调度类为p->sched_class = &rt_sched_class;
-
否则的话就剩下普通进程了,设置进程的调度类为p->sched_class = &fair_sched_class;, fair就是当前内核版本中公平的调度算法,CFS
-
init_entity_runnable_average: 用来厨师阿虎调度实体的负载
-
p->sched_class→task_fork(p); 调用该进程调度类中的task_fork函数。每个调度类中都会实现此方法,用于fork创建时新创建子进程关于调度的设置。
-
init_task_preempt_count 初始化当前进程的preempt_count字段。此字段包含抢占使能,中断使能等
static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
struct files_struct *oldf, *newf;
int error = 0;
/*
* A background process may not have any files ...
*/
oldf = current->files;
if (!oldf)
goto out;
if (clone_flags & CLONE_FILES)
atomic_inc(&oldf->count);
goto out;
newf = dup_fd(oldf, &error);
if (!newf)
goto out;
tsk->files = newf;
error = 0;
out:
return error;
- oldf = current→files; 获取当前进程的文件信息,如果当前进程的文件信息为NULL,有可能此进程是一个后台进程。
- if (clone_flags & CLONE_FILES) 如果设置了CLONE_FILES,代表新创建的进程和当前进程共享文件信息,则count引用计数加1. 这个就是pthread_create的流程
- dup_fd(oldf, &error); 此函数不展开说了,就是将当前进程的fd,代表打开的文件,统统复制到新创建的进程中去。
static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
struct fs_struct *fs = current->fs;
if (clone_flags & CLONE_FS)
/* tsk->fs is already what we want */
spin_lock(&fs->lock);
if (fs->in_exec)
spin_unlock(&fs->lock);
return -EAGAIN;
fs->users++;
spin_unlock(&fs->lock);
return 0;
tsk->fs = copy_fs_struct(fs);
if (!tsk->fs)
return -ENOMEM;
return 0;
- 如果设置CLONE_FS标志,则只需要将fs→users加1,代表有一个进程也再共享当前的文件系统资源
- 否则则调用copy_fs_struct重新为新创建的进程创建一个fs_struct结构,然后将当前进程的资源做一次copy,比如当前的root目录,pwd目录等。
static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
struct sighand_struct *sig;
if (clone_flags & CLONE_SIGHAND)
atomic_inc(¤t->sighand->count);
return 0;
sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
rcu_assign_pointer(tsk->sighand, sig);
if (!sig)
return -ENOMEM;
atomic_set(&sig->count, 1);
spin_lock_irq(¤t->sighand->siglock);
memcpy(sig->action, current->sighand->action, sizeof(sig->action));
spin_unlock_irq(¤t->sighand->siglock);
return 0;
- 当设置了CLONE_SIGHAND, 也就是新创建的进程和当前进程共享信号处理程序,则¤t->sighand→count增加当前进程的信号处理程序引用计数
- 否则,分配一个sighand_struct结构体,将当前进程的信号处理的action做一次copy
static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
struct signal_struct *sig;
if (clone_flags & CLONE_THREAD)
return 0;
sig = kmem_cache_zalloc(signal_cachep, GFP_KERNEL);
tsk->signal = sig;
if (!sig)
return -ENOMEM;
sig->nr_threads = 1;
atomic_set(&sig->live, 1);
atomic_set(&sig->sigcnt, 1);
/* list_add(thread_node, thread_head) without INIT_LIST_HEAD() */
sig->thread_head = (struct list_head)LIST_HEAD_INIT(tsk->thread_node);
tsk->thread_node = (struct list_head)LIST_HEAD_INIT(sig->thread_head);
init_waitqueue_head(&sig->wait_chldexit);
sig->curr_target = tsk;
init_sigpending(&sig->shared_pending);
INIT_HLIST_HEAD(&sig->multiprocess);
seqlock_init(&sig->stats_lock);
prev_cputime_init(&sig->prev_cputime);
#ifdef CONFIG_POSIX_TIMERS
INIT_LIST_HEAD(&sig->posix_timers);
hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
sig->real_timer.function = it_real_fn;
#endif
task_lock(current->group_leader);
memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim);
task_unlock(current->group_leader);
posix_cpu_timers_init_group(sig);
tty_audit_fork(sig);
sched_autogroup_fork(sig);
sig->oom_score_adj = current->signal->oom_score_adj;
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
mutex_init(&sig->cred_guard_mutex);
return 0;
- 如果是创建的是线程CLONE_THREAD设置的话,则直接返回。线程本身和进程是共享信号的
- 否则重新分配一个signal_struct结构体,然后做进一步的初始化
- 其中有些成功时copy来自父进程的,比如oom的值
static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
struct mm_struct *mm, *oldmm;
int retval;
tsk->mm = NULL;
tsk->active_mm = NULL;
/*
* Are we cloning a kernel thread?
*
* We need to steal a active VM for that..
*/
oldmm = current->mm;
if (!oldmm)
return 0;
/* initialize the new vmacache entries */
vmacache_flush(tsk);
if (clone_flags & CLONE_VM)
mmget(oldmm);
mm = oldmm;
goto good_mm;
retval = -ENOMEM;
mm = dup_mm(tsk);
if (!mm)
goto fail_nomem;
good_mm:
tsk->mm = mm;
tsk->active_mm = mm;
return 0;
fail_nomem:
return retval;
- 如果当前进程的mm_struct结构为NULL,则当前进程是一个内核线程。如果是一个内核线程,则一般是借用另外一个进程的mm
- 如果设置了CLONE_VM,则新创建进程的mm和当前进程mm共享
- 否则调用dup_mm,重新分配一个mm_struct结构,将当前进程的mm_struct的内容做一次copy
- dup_mm里面实际会做很多东西,里面会对当前进程的VMA做一次copy,以及页表项页做一次拷贝。这里就会涉及到写时复制的技术
/*
* Allocate a new mm structure and copy contents from the
* mm structure of the passed in task structure.
*/
static struct mm_struct *dup_mm(struct task_struct *tsk)
struct mm_struct *mm, *oldmm = current->mm;
int err;
mm = allocate_mm();
if (!mm)
goto fail_nomem;
memcpy(mm, oldmm, sizeof(*mm));
if (!mm_init(mm, tsk, mm->user_ns))
goto fail_nomem;
err = dup_mmap(mm, oldmm);
if (err)
goto free_pt;
mm->hiwater_rss = get_mm_rss(mm);
mm->hiwater_vm = mm->total_vm;
if (mm->binfmt && !try_module_get(mm->binfmt->module))
goto free_pt;
return mm;
free_pt:
/* don't put binfmt in mmput, we haven't got module yet */
mm->binfmt = NULL;
mmput(mm);
fail_nomem:
return NULL;
- 通过注释可以看出,是重新分配一个mm_struct结构,然后做一次copy
- allocate_mm();重新分配一个mm_struct结构
- memcpy(mm, oldmm, sizeof(*mm)); 做一次copy
- mm_init(mm, tsk, mm→user_ns) : 对刚分配的mm_struct结构做初始化的操作,其中会为当前进程分配一个pgd,基全局目录项。这个是在虚拟地址转化物理地址中会使用到
- dup_mmap: 实际来复制父进程的VMA对应的PTE页表项到子进程的页表项中,这里不复制VMA对应的真是内容。
- 复制完毕后,父子进程都会来共享相同的物理地址,只要双方谁先写这块物理内存,则会发生分裂,然后给先写的一方重新分配物理内存。
图片来自互联网,因为这张图画的很好,很能说明问题
开发者涨薪指南 48位大咖的思考法则、工作方式、逻辑体系
以上是关于do_fork实现--上的主要内容,如果未能解决你的问题,请参考以下文章
Linux下进程的创建过程分析(_do_fork/do_fork详解)--Linux进程的管理与调度
Linux 内核进程管理 ( 进程相关系统调用源码分析 | fork() 源码 | vfork() 源码 | clone() 源码 | _do_fork() 源码 | do_fork() 源码 )