do_fork实现--上

Posted Loopers

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了do_fork实现--上相关的知识,希望对你有一定的参考价值。

在前面几节中讲述了如何通过fork, vfork, pthread_create去创建一个进程,或者一个线程。通过分析最终fork, vfork, pthread_create最终会通过系统调用clone去创建进程。

今天我们就重点来分析clone之后的事情。为了学习今天这一节,前面的都是铺垫。

 

既然fork, vfork, pthread_create都去调用到clone这个系统调用,我们就沿着这个线索往下继续分析。

SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
         int __user *, parent_tidptr,
         int __user *, child_tidptr,
         unsigned long, tls)

    return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);

在fork.c的文件中我们找到了clone的系统调用的实现,可以看出clone直接调用了_do_fork函数。

/*
 *  Ok, this is the main fork-routine.
 *
 * It copies the process, and if successful kick-starts
 * it and waits for it to finish using the VM if required.
 */
long _do_fork(unsigned long clone_flags,
          unsigned long stack_start,
          unsigned long stack_size,
          int __user *parent_tidptr,
          int __user *child_tidptr,
          unsigned long tls)

    struct completion vfork;
    struct pid *pid;
    struct task_struct *p;
    int trace = 0;
    long nr;
 
    /*
     * Determine whether and which event to report to ptracer.  When
     * called from kernel_thread or CLONE_UNTRACED is explicitly
     * requested, no event is reported; otherwise, report if the event
     * for the type of forking is enabled.
     */
    if (!(clone_flags & CLONE_UNTRACED)) 
        if (clone_flags & CLONE_VFORK)
            trace = PTRACE_EVENT_VFORK;
        else if ((clone_flags & CSIGNAL) != SIGCHLD)
            trace = PTRACE_EVENT_CLONE;
        else
            trace = PTRACE_EVENT_FORK;
 
        if (likely(!ptrace_event_enabled(current, trace)))
            trace = 0;
    
 
    p = copy_process(clone_flags, stack_start, stack_size,
             child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
    add_latent_entropy();
 
    if (IS_ERR(p))
        return PTR_ERR(p);
 
    /*
     * Do this prior waking up the new thread - the thread pointer
     * might get invalid after that point, if the thread exits quickly.
     */
    trace_sched_process_fork(current, p);
 
    pid = get_task_pid(p, PIDTYPE_PID);
    nr = pid_vnr(pid);   //nr就是父子进程的返回值。子进程会返回0,而父进程会返回子进程的pid的
 
    if (clone_flags & CLONE_PARENT_SETTID)
        put_user(nr, parent_tidptr);
 
    if (clone_flags & CLONE_VFORK)            //如果flag中存在CLONE_VFORK,说明是通过vfork来创建子进程,这时候需要初始化一个完成量
        p->vfork_done = &vfork;
        init_completion(&vfork);
        get_task_struct(p);
    
 
    wake_up_new_task(p);                   //唤醒子进程
 
    /* forking complete and child started to run, tell ptracer */
    if (unlikely(trace))
        ptrace_event_pid(trace, pid);
 
    if (clone_flags & CLONE_VFORK) 
        if (!wait_for_vfork_done(p, &vfork))
            ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
    
 
    put_pid(pid);
    return nr;

OK! 从注释上可以看出do_fork函数就是fork的主要流程了。今天我们重点分析此函数。

  • copy_process是创建一个进程的主要函数,里面功能主要是负责拷贝父进程的相关资源。待会详细分析
  • get_task_pid(p, PIDTYPE_PID);当copy_process成功返回时说明子进程已经创建成功了,这时候需要给子进程分配一个pid
  • wake_up_new_task 这时候就需要将子进程加入到就绪队列中去,至于何时被调度是调度器说了算
  • wait_for_vfork_done(p, &vfork)): 如果存在CLONE_VFORK标准,则子进程需要先运行,父进程需要在这里等待。
  • 这就是为什么前面使用vfork在创建一个进程的时候,子进程先运行,而父进程后运行。

 

copy_process分析

copy_process函数的实现比较长,我们采用分段的方式来分析

static __latent_entropy struct task_struct *copy_process(
                    unsigned long clone_flags,
                    unsigned long stack_start,
                    unsigned long stack_size,
                    int __user *child_tidptr,
                    struct pid *pid,
                    int trace,
                    unsigned long tls,
                    int node)

    int retval;
    struct task_struct *p;
    struct multiprocess_signals delayed;
 
    /*
     * Don't allow sharing the root directory with processes in a different
     * namespace
     */
    if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
        return ERR_PTR(-EINVAL);
 
    if ((clone_flags & (CLONE_NEWUSER|CLONE_FS)) == (CLONE_NEWUSER|CLONE_FS))
        return ERR_PTR(-EINVAL);
 
    /*
     * Thread groups must share signals as well, and detached threads
     * can only be started up within the thread group.
     */
    if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
        return ERR_PTR(-EINVAL);
 
    /*
     * Shared signal handlers imply shared VM. By way of the above,
     * thread groups also imply shared VM. Blocking this case allows
     * for various simplifications in other code.
     */
    if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
        return ERR_PTR(-EINVAL);
 
    /*
     * Siblings of global init remain as zombies on exit since they are
     * not reaped by their parent (swapper). To solve this and to avoid
     * multi-rooted process trees, prevent global and container-inits
     * from creating siblings.
     */
    if ((clone_flags & CLONE_PARENT) &&
                current->signal->flags & SIGNAL_UNKILLABLE)
        return ERR_PTR(-EINVAL);
 
    /*
     * If the new process will be in a different pid or user namespace
     * do not allow it to share a thread group with the forking task.
     */
    if (clone_flags & CLONE_THREAD) 
        if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) ||
            (task_active_pid_ns(current) !=
                current->nsproxy->pid_ns_for_children))
            return ERR_PTR(-EINVAL);
    

刚开始上来检查传递的flag是不是存在冲突,我们找几个简单描述下,不全部描述

  • CLONE_NEWNS和CLONE_FS注释说的很清楚了,不允许再不同的命名空间中空闲文件系统资源,命名空间本身的设计就是用来隔离进程之间的资源互相访问的
  • CLONE_THREAD和CLONE_SIGHAND需要配合使用,THread线程必须共享信号资源
struct task_struct *p;
  
 if (signal_pending(current))
        goto fork_out;
 
 p = dup_task_struct(current, node);
  • current代表的父进程的,当父进程收到信号中断时,需要fork_out出去。比如父进程突然收到了kill -9的信号,则肯定不能完成子进程的创建
  • dup_task_struct函数主要是为子进程创建一个新的task_struct结构,然后复制父进程的task_struct结构到子进程新创建的。
static struct task_struct *dup_task_struct(struct task_struct *orig, int node)

    struct task_struct *tsk;
    unsigned long *stack;
    struct vm_struct *stack_vm_area __maybe_unused;
    int err;
 
    if (node == NUMA_NO_NODE)
        node = tsk_fork_get_node(orig);
    tsk = alloc_task_struct_node(node);
 
    stack = alloc_thread_stack_node(tsk, node);
 
    stack_vm_area = task_stack_vm_area(tsk);
 
    err = arch_dup_task_struct(tsk, orig);
 
    tsk->stack = stack;
 
    setup_thread_stack(tsk, orig);
    clear_user_return_notifier(tsk);
    clear_tsk_need_resched(tsk);
    set_task_stack_end_magic(tsk);
 
    return tsk;

对dup_task_struct函数的代码做了一些精简,去掉了一些错误的判断。

  • alloc_task_struct_node为子进程分配一个task_struct结构
  • alloc_thread_stack_node为子进程一个内核栈,上面的两个分配函数都是借用了slab机制,创建的高速缓存池来分配,加速了进程创建的效率。
  • arch_dup_task_struct将父进程task_struct的内容复制给子进程的task_struct
  • tsk->stack = stack; 设置子进程的内核栈
  • setup_thread_stack建立thread_info和内核栈的关系,因为ARM64采用的是thread_info存储在task_struct结构中,则此函数为空
  • clear_tsk_need_resched: 清空子进程需要调度的标志位
  • set_task_stack_end_magic:设置子进程内核栈结束处的幻术,为了踩踏内核栈做debug功能使用。
 retval = copy_creds(p, clone_flags);
    if (retval < 0)
        goto bad_fork_free;
 
    /*
     * If multiple threads are within copy_process(), then this check
     * triggers too late. This doesn't hurt, the check is only there
     * to stop root fork bombs.
     */
    retval = -EAGAIN;
    if (nr_threads >= max_threads)
        goto bad_fork_cleanup_count;
 
 
int nr_threads;            /* The idle threads do not count.. */
int max_threads;        /* tunable limit on nr_threads */
  • copy_creds用户拷贝父进程的证书
  • nr_threads代表当前进程数,max_threads代表最大的。如果当前进程数超过最大的设置,则创建子进程失败,退出fork流程
INIT_LIST_HEAD(&p->children);
    INIT_LIST_HEAD(&p->sibling);
    rcu_copy_process(p);
    p->vfork_done = NULL;
    spin_lock_init(&p->alloc_lock);
 
    init_sigpending(&p->pending);
 
    p->utime = p->stime = p->gtime = 0;
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
    p->utimescaled = p->stimescaled = 0;
#endif
    prev_cputime_init(&p->prev_cputime);
 
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
    seqcount_init(&p->vtime.seqcount);
    p->vtime.starttime = 0;
    p->vtime.state = VTIME_INACTIVE;
#endif
   
   。。。。。。

接下来一大段是初始化子进程的Task_struct结构,刚才子进程的task_struct是从父进程哪里copy过来的,有些字段子进程需要重新进化设置,不需要父进程的东西,则需要做一次初始化操作。

/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
if (retval)
    goto bad_fork_cleanup_policy;
 
retval = perf_event_init_task(p);
if (retval)
    goto bad_fork_cleanup_policy;
retval = audit_alloc(p);
if (retval)
    goto bad_fork_cleanup_perf;
/* copy all the process information */
shm_init_task(p);
retval = security_task_alloc(p, clone_flags);
if (retval)
    goto bad_fork_cleanup_audit;
retval = copy_semundo(clone_flags, p);
if (retval)
    goto bad_fork_cleanup_security;
retval = copy_files(clone_flags, p);
if (retval)
    goto bad_fork_cleanup_semundo;
retval = copy_fs(clone_flags, p);
if (retval)
    goto bad_fork_cleanup_files;
retval = copy_sighand(clone_flags, p);
if (retval)
    goto bad_fork_cleanup_fs;
retval = copy_signal(clone_flags, p);
if (retval)
    goto bad_fork_cleanup_sighand;
retval = copy_mm(clone_flags, p);
if (retval)
    goto bad_fork_cleanup_signal;
retval = copy_namespaces(clone_flags, p);
if (retval)
    goto bad_fork_cleanup_mm;
retval = copy_io(clone_flags, p);
if (retval)
    goto bad_fork_cleanup_namespaces;
retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
if (retval)
    goto bad_fork_cleanup_io;

接着又是一大段的copy操作。我们在讲解fork创建进程的时候说过,fork是将父进程的资源统统的做一次拷贝的操作。这就是真真的拷贝的操作。

整个copy的动作很多,基本都是对父进程资源的拷贝,我们这里简单的举几个例子详细说明下,就举我们经常会涉及到的,sched_fork,  copy_fs, copy_mm。

 

新创建的进程肯定要参与到调度啊,所以首要先初始化子进程调度相关的信息。

int sched_fork(unsigned long clone_flags, struct task_struct *p)

    unsigned long flags;
 
    __sched_fork(clone_flags, p);
    /*
     * We mark the process as NEW here. This guarantees that
     * nobody will actually run it, and a signal or other external
     * event cannot wake it up and insert it on the runqueue either.
     */
    p->state = TASK_NEW;
 
    /*
     * Make sure we do not leak PI boosting priority to the child.
     */
    p->prio = current->normal_prio;
 
    /*
     * Revert to default priority/policy on fork if requested.
     */
    if (unlikely(p->sched_reset_on_fork)) 
        if (task_has_dl_policy(p) || task_has_rt_policy(p)) 
            p->policy = SCHED_NORMAL;
            p->static_prio = NICE_TO_PRIO(0);
            p->rt_priority = 0;
         else if (PRIO_TO_NICE(p->static_prio) < 0)
            p->static_prio = NICE_TO_PRIO(0);
 
        p->prio = p->normal_prio = __normal_prio(p);
        set_load_weight(p, false);
 
        /*
         * We don't need the reset flag anymore after the fork. It has
         * fulfilled its duty:
         */
        p->sched_reset_on_fork = 0;
    
 
    if (dl_prio(p->prio))
        return -EAGAIN;
    else if (rt_prio(p->prio))
        p->sched_class = &rt_sched_class;
    else
        p->sched_class = &fair_sched_class;
 
    init_entity_runnable_average(&p->se);
 
    /*
     * The child is not yet in the pid-hash so no cgroup attach races,
     * and the cgroup is pinned to this child due to cgroup_fork()
     * is ran before sched_fork().
     *
     * Silence PROVE_RCU.
     */
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    /*
     * We're setting the CPU for the first time, we don't migrate,
     * so use __set_task_cpu().
     */
    __set_task_cpu(p, smp_processor_id());
    if (p->sched_class->task_fork)
        p->sched_class->task_fork(p);
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    init_task_preempt_count(p);
    return 0;
  • __sched_fork: 此函数主要是对task_struct中调度相关的信息进行初始化,比如下面就是对调度实体进行初始化,se是sched_entity的缩写。有关系调度的知识我们在调度小节详细描述
p->on_rq            = 0;
 
p->se.on_rq            = 0;
p->se.exec_start        = 0;
p->se.sum_exec_runtime        = 0;
p->se.prev_sum_exec_runtime    = 0;
p->se.nr_migrations        = 0;
p->se.vruntime            = 0;
INIT_LIST_HEAD(&p->se.group_node);
  • p->state = TASK_NEW; 表示这是一个新创建的进程
  • p->prio = current→normal_prio; 设置新创建进程的优先级,优先级是跟随当前进程的

  • sched_reset_on_fork如果设置的话,意思是当前进程调用sched_setscheduler函数重新设置了进程的优先级和调度策略。这里需要revert进程的优先级和调度策略。

  • if (dl_prio(p→prio))  如果当前进程是DL(DeadLine)进程的话,则会返回EAGAIN。 因为DL进程不允许分裂,也就是创建子进程。

  • if (rt_prio(p→prio))  如果当前进程是RT(ReadTime)进程,也就是实时进程的话,设置进程的调度类为p->sched_class = &rt_sched_class;

  • 否则的话就剩下普通进程了,设置进程的调度类为p->sched_class = &fair_sched_class;, fair就是当前内核版本中公平的调度算法,CFS

  • init_entity_runnable_average: 用来厨师阿虎调度实体的负载

  • p->sched_class→task_fork(p); 调用该进程调度类中的task_fork函数。每个调度类中都会实现此方法,用于fork创建时新创建子进程关于调度的设置。

  • init_task_preempt_count 初始化当前进程的preempt_count字段。此字段包含抢占使能,中断使能等

static int copy_files(unsigned long clone_flags, struct task_struct *tsk)

    struct files_struct *oldf, *newf;
    int error = 0;
 
    /*
     * A background process may not have any files ...
     */
    oldf = current->files;
    if (!oldf)
        goto out;
 
    if (clone_flags & CLONE_FILES) 
        atomic_inc(&oldf->count);
        goto out;
    
 
    newf = dup_fd(oldf, &error);
    if (!newf)
        goto out;
 
    tsk->files = newf;
    error = 0;
out:
    return error;
  • oldf = current→files; 获取当前进程的文件信息,如果当前进程的文件信息为NULL,有可能此进程是一个后台进程。
  • if (clone_flags & CLONE_FILES) 如果设置了CLONE_FILES,代表新创建的进程和当前进程共享文件信息,则count引用计数加1. 这个就是pthread_create的流程
  • dup_fd(oldf, &error); 此函数不展开说了,就是将当前进程的fd,代表打开的文件,统统复制到新创建的进程中去。
static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)

    struct fs_struct *fs = current->fs;
    if (clone_flags & CLONE_FS) 
        /* tsk->fs is already what we want */
        spin_lock(&fs->lock);
        if (fs->in_exec) 
            spin_unlock(&fs->lock);
            return -EAGAIN;
        
        fs->users++;
        spin_unlock(&fs->lock);
        return 0;
    
    tsk->fs = copy_fs_struct(fs);
    if (!tsk->fs)
        return -ENOMEM;
    return 0;
  • 如果设置CLONE_FS标志,则只需要将fs→users加1,代表有一个进程也再共享当前的文件系统资源
  • 否则则调用copy_fs_struct重新为新创建的进程创建一个fs_struct结构,然后将当前进程的资源做一次copy,比如当前的root目录,pwd目录等。
static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)

    struct sighand_struct *sig;
 
    if (clone_flags & CLONE_SIGHAND) 
        atomic_inc(&current->sighand->count);
        return 0;
    
    sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
    rcu_assign_pointer(tsk->sighand, sig);
    if (!sig)
        return -ENOMEM;
 
    atomic_set(&sig->count, 1);
    spin_lock_irq(&current->sighand->siglock);
    memcpy(sig->action, current->sighand->action, sizeof(sig->action));
    spin_unlock_irq(&current->sighand->siglock);
    return 0;
  • 当设置了CLONE_SIGHAND, 也就是新创建的进程和当前进程共享信号处理程序,则&current->sighand→count增加当前进程的信号处理程序引用计数
  • 否则,分配一个sighand_struct结构体,将当前进程的信号处理的action做一次copy
static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)

    struct signal_struct *sig;
 
    if (clone_flags & CLONE_THREAD)
        return 0;
 
    sig = kmem_cache_zalloc(signal_cachep, GFP_KERNEL);
    tsk->signal = sig;
    if (!sig)
        return -ENOMEM;
 
    sig->nr_threads = 1;
    atomic_set(&sig->live, 1);
    atomic_set(&sig->sigcnt, 1);
 
    /* list_add(thread_node, thread_head) without INIT_LIST_HEAD() */
    sig->thread_head = (struct list_head)LIST_HEAD_INIT(tsk->thread_node);
    tsk->thread_node = (struct list_head)LIST_HEAD_INIT(sig->thread_head);
 
    init_waitqueue_head(&sig->wait_chldexit);
    sig->curr_target = tsk;
    init_sigpending(&sig->shared_pending);
    INIT_HLIST_HEAD(&sig->multiprocess);
    seqlock_init(&sig->stats_lock);
    prev_cputime_init(&sig->prev_cputime);
 
#ifdef CONFIG_POSIX_TIMERS
    INIT_LIST_HEAD(&sig->posix_timers);
    hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
    sig->real_timer.function = it_real_fn;
#endif
 
    task_lock(current->group_leader);
    memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim);
    task_unlock(current->group_leader);
 
    posix_cpu_timers_init_group(sig);
 
    tty_audit_fork(sig);
    sched_autogroup_fork(sig);
 
    sig->oom_score_adj = current->signal->oom_score_adj;
    sig->oom_score_adj_min = current->signal->oom_score_adj_min;
 
    mutex_init(&sig->cred_guard_mutex);
 
    return 0;
  • 如果是创建的是线程CLONE_THREAD设置的话,则直接返回。线程本身和进程是共享信号的
  • 否则重新分配一个signal_struct结构体,然后做进一步的初始化
  • 其中有些成功时copy来自父进程的,比如oom的值
static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)

    struct mm_struct *mm, *oldmm;
    int retval;
 
    tsk->mm = NULL;
    tsk->active_mm = NULL;
 
    /*
     * Are we cloning a kernel thread?
     *
     * We need to steal a active VM for that..
     */
    oldmm = current->mm;
    if (!oldmm)
        return 0;
 
    /* initialize the new vmacache entries */
    vmacache_flush(tsk);
 
    if (clone_flags & CLONE_VM) 
        mmget(oldmm);
        mm = oldmm;
        goto good_mm;
    
 
    retval = -ENOMEM;
    mm = dup_mm(tsk);
    if (!mm)
        goto fail_nomem;
 
good_mm:
    tsk->mm = mm;
    tsk->active_mm = mm;
    return 0;
 
fail_nomem:
    return retval;
  • 如果当前进程的mm_struct结构为NULL,则当前进程是一个内核线程。如果是一个内核线程,则一般是借用另外一个进程的mm
  • 如果设置了CLONE_VM,则新创建进程的mm和当前进程mm共享
  • 否则调用dup_mm,重新分配一个mm_struct结构,将当前进程的mm_struct的内容做一次copy
  • dup_mm里面实际会做很多东西,里面会对当前进程的VMA做一次copy,以及页表项页做一次拷贝。这里就会涉及到写时复制的技术
/*
 * Allocate a new mm structure and copy contents from the
 * mm structure of the passed in task structure.
 */
static struct mm_struct *dup_mm(struct task_struct *tsk)

    struct mm_struct *mm, *oldmm = current->mm;
    int err;
 
    mm = allocate_mm();
    if (!mm)
        goto fail_nomem;
 
    memcpy(mm, oldmm, sizeof(*mm));
 
    if (!mm_init(mm, tsk, mm->user_ns))
        goto fail_nomem;
 
    err = dup_mmap(mm, oldmm);
    if (err)
        goto free_pt;
 
    mm->hiwater_rss = get_mm_rss(mm);
    mm->hiwater_vm = mm->total_vm;
 
    if (mm->binfmt && !try_module_get(mm->binfmt->module))
        goto free_pt;
 
    return mm;
 
free_pt:
    /* don't put binfmt in mmput, we haven't got module yet */
    mm->binfmt = NULL;
    mmput(mm);
 
fail_nomem:
    return NULL;
  • 通过注释可以看出,是重新分配一个mm_struct结构,然后做一次copy
  • allocate_mm();重新分配一个mm_struct结构
  • memcpy(mm, oldmm, sizeof(*mm)); 做一次copy
  • mm_init(mm, tsk, mm→user_ns) : 对刚分配的mm_struct结构做初始化的操作,其中会为当前进程分配一个pgd,基全局目录项。这个是在虚拟地址转化物理地址中会使用到
  • dup_mmap: 实际来复制父进程的VMA对应的PTE页表项到子进程的页表项中,这里不复制VMA对应的真是内容。
  • 复制完毕后,父子进程都会来共享相同的物理地址,只要双方谁先写这块物理内存,则会发生分裂,然后给先写的一方重新分配物理内存。

   图片来自互联网,因为这张图画的很好,很能说明问题

 

开发者涨薪指南 48位大咖的思考法则、工作方式、逻辑体系

以上是关于do_fork实现--上的主要内容,如果未能解决你的问题,请参考以下文章

do_fork函数

Linux下进程的创建过程分析(_do_fork/do_fork详解)--Linux进程的管理与调度

do_fork() 源码剖析

Linux 内核进程管理 ( 进程相关系统调用源码分析 | fork() 源码 | vfork() 源码 | clone() 源码 | _do_fork() 源码 | do_fork() 源码 )

Linux内核 fork 源码分析

fork()相关的源码解析