CFS调度主要代码分析二

Posted 2022-06-11 Loopers

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了CFS调度主要代码分析二相关的知识，希望对你有一定的参考价值。

在上一篇文章中我们分析CFS的主要代码，设计的内容有：

进程创建时调度器是如何初始化一个进程的
进程是如何添加到CFS运行队列中
当进程添加到CFS运行队列中，是如何选择下一个进程运行的

本节在围绕一个进程的生命周期，继续分析一个进程是如何被抢占？如果睡眠？如何被调度出去的？

Schedule_tick(周期性调度)

周期性调度就是Linux内核会在每一个tick的时候会去更新当前进程的运行时间，已经判断当前进程是否需要被调度出去等。

在时钟中断的处理函数中会调用update_process_times，最终会调用到调度器相关的scheduler_tick函数中

void scheduler_tick(void)

	int cpu = smp_processor_id();
	struct rq *rq = cpu_rq(cpu);
	struct task_struct *curr = rq->curr;
	struct rq_flags rf;

	sched_clock_tick();

	rq_lock(rq, &rf);

	update_rq_clock(rq);
	curr->sched_class->task_tick(rq, curr, 0);
	cpu_load_update_active(rq);
	calc_global_load_tick(rq);
	psi_task_tick(rq);

	rq_unlock(rq, &rf);

	perf_event_task_tick();

#ifdef CONFIG_SMP
	rq->idle_balance = idle_cpu(cpu);
	trigger_load_balance(rq);
#endif

获取当前CPU上的运行队列rq, 在根据调度类sched_class去调用该进程调度类中的task_tick函数，此处我们只描述CFS调度类

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)

	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &curr->se;

	for_each_sched_entity(se) 
		cfs_rq = cfs_rq_of(se);
		entity_tick(cfs_rq, se, queued);
	

	if (static_branch_unlikely(&sched_numa_balancing))
		task_tick_numa(rq, curr);

	update_misfit_status(curr, rq);
	update_overutilized_status(task_rq(curr));

通过当前的task_struct，获取调度实体se，然后根据调度实体se获取CFS运行队列，通过entity_tick函数做进一步操作

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)

	/*
	 * Update run-time statistics of the 'current'.
	 */
	update_curr(cfs_rq);

	/*
	 * Ensure that runnable average is periodically updated.
	 */
	update_load_avg(cfs_rq, curr, UPDATE_TG);
	update_cfs_group(curr);


	if (cfs_rq->nr_running > 1)
		check_preempt_tick(cfs_rq, curr);

update_curr在之前有分析过，此函数主要是更新当前current进程的执行时间，vruntime以及CFS运行队列的min_vruntime
update_load_avg 主要是用来更新调度实体的负载以及CFS运行队列的负载，在负载章节详细描述
如果当前CFS运行队列的个数大于1，则需要坚持下是否需要抢占当前进程的。

static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)

	unsigned long ideal_runtime, delta_exec;
	struct sched_entity *se;
	s64 delta;

	ideal_runtime = sched_slice(cfs_rq, curr);
	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
	if (delta_exec > ideal_runtime) 
		resched_curr(rq_of(cfs_rq));
		/*
		 * The current task ran long enough, ensure it doesn't get
		 * re-elected due to buddy favours.
		 */
		clear_buddies(cfs_rq, curr);
		return;
	

	/*
	 * Ensure that a task that missed wakeup preemption by a
	 * narrow margin doesn't have to wait for a full slice.
	 * This also mitigates buddy induced latencies under load.
	 */
	if (delta_exec < sysctl_sched_min_granularity)
		return;

	se = __pick_first_entity(cfs_rq);
	delta = curr->vruntime - se->vruntime;

	if (delta < 0)
		return;

	if (delta > ideal_runtime)
		resched_curr(rq_of(cfs_rq));

sched_slice用来获取当前进程在一个调度周期中理想的运行时间
sum_exec_runtime代表的是在这一次调度中，总共执行的时间，是在update_curr中每次更新的
prev_sum_exec_runtime是代表上次调度出去的时间，是在pick_next函数中设置的。
delta_exec的时间则代表的是本次调度周期中实际运行的时间。
如果时间运行的时间大于理性的调度时间，则表示本次调度时间已经超出预期，需要调度出去，则需要设置need_resched标志
如果时间的运行时间小于sysctl_sched_min_granularity，则不需要调度。sysctl_sched_min_granularity此值保证在一个调度周期中最少运行的时间
从CFS红黑树找出最左边的调度实体se。将当前进程的vruntime和se的vruntime做比值
如果delta小于0，则说明当前进程vruntime比最新的vruntime还小，则不调度，继续运行
如果大于ideal_runtime，如果大于理想时间，则表示运行时间已经超过太多，则需要调度。

进程睡眠

当一个进程由于要等待资源，而不得不去放弃CPU，则会选择将自己调度出去。比如串口在等待有数据发送过来，则不得不让出CPU，让别的进程来占用CPU，最大资源的使用CPU。通常需要睡眠的进程都会使用schedule函数来让出CPU

asmlinkage __visible void __sched schedule(void)

	struct task_struct *tsk = current;

	sched_submit_work(tsk);
	do 
		preempt_disable();
		__schedule(false);
		sched_preempt_enable_no_resched();
	 while (need_resched());


static void __sched notrace __schedule(bool preempt)

	cpu = smp_processor_id();
	rq = cpu_rq(cpu);
	prev = rq->curr;

	if (!preempt && prev->state) 
		if (signal_pending_state(prev->state, prev)) 
			prev->state = TASK_RUNNING;
		 else 
			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
			prev->on_rq = 0;

       ........

当一个进程调度schedule的函数时，传递的参数是flase。false的意思是当前不是发生抢占。之前在进程的基本概念中描述了进程的状态，进程的状态是running的时候等于0，其余是非0的。则就通过deactivate_task函数，将当前进程从rq中移除掉。

void deactivate_task(struct rq *rq, struct task_struct *p, int flags)

	if (task_contributes_to_load(p))
		rq->nr_uninterruptible++;

	dequeue_task(rq, p, flags);


static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)


	p->sched_class->dequeue_task(rq, p, flags);

最终调用到属于该进程的调度类中的dequeue_task函数中，这里还是以CFS调度类为例子

static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &p->se;
	int task_sleep = flags & DEQUEUE_SLEEP;

	for_each_sched_entity(se) 
		cfs_rq = cfs_rq_of(se);
		dequeue_entity(cfs_rq, se, flags);

                cfs_rq->h_nr_running--;

获取该进程的调度实体，再获取调度实体属于的CFS运行队列，通过dequeue_entity函数将调度实体从CFS运行队列删除

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

	/*
	 * Update run-time statistics of the 'current'.
	 */
	update_curr(cfs_rq);

	/*
	 * When dequeuing a sched_entity, we must:
	 *   - Update loads to have both entity and cfs_rq synced with now.
	 *   - Subtract its load from the cfs_rq->runnable_avg.
	 *   - Subtract its previous weight from cfs_rq->load.weight.
	 *   - For group entity, update its weight to reflect the new share
	 *     of its group cfs_rq.
	 */
	update_load_avg(cfs_rq, se, UPDATE_TG);
	dequeue_runnable_load_avg(cfs_rq, se);

	update_stats_dequeue(cfs_rq, se, flags);

	clear_buddies(cfs_rq, se);

	if (se != cfs_rq->curr)
		__dequeue_entity(cfs_rq, se);
	se->on_rq = 0;

当一个调度实体产品你个CFS润兴队列移除时，需要做以下事情
更新调度实体和CFS运行队列的负载
减去调度实体的负载从CFS_rq->runnable_avg中
减去调度实体的权重以及组调度的权重等
调用__dequeue_entity函数将需要移除的调度实体从CFS红黑树移除
然后更新on_rq的值等于0，代表此调度实体已不在CFS就绪队列中。

唤醒一个进程

之前在fork一个新进程之后，最后会通过wake_up_new_task来唤醒一个进程，这个函数在上篇中讲过如何将一个进程添加到CFS就绪队列

void wake_up_new_task(struct task_struct *p)

    p->state = TASK_RUNNING;

    activate_task(rq, p, ENQUEUE_NOCLOCK);
    p->on_rq = TASK_ON_RQ_QUEUED;
	
    check_preempt_curr(rq, p, WF_FORK);

通过activate_task会将此函数添加到就绪队列中，同时check_preempt_curr函数用来检查唤醒的进程是否可以强制当前进程。因为一个唤醒的进程可能是更高优先级的实时进程，当前进程是个普通进程等，都有可能发生。

void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)

	const struct sched_class *class;

	if (p->sched_class == rq->curr->sched_class) 
		rq->curr->sched_class->check_preempt_curr(rq, p, flags);
	 else 
		for_each_class(class) 
			if (class == rq->curr->sched_class)
				break;
			if (class == p->sched_class) 
				resched_curr(rq);
				break;

如果唤醒的进程的调度类和当前运行进程的调度类是相同的，则调用调度类中的check_preempt_curr回调
如果唤醒的进程的调度类和当前正在运行的调度类不一样。如果当前是普通进程，这里唤醒的是实时进程，则直接调用reshced_curr函数，给当前进程设置need_sched的标志位，在下一个调度点调度出去。
如果当前进程比唤醒的进程的调度类低，则需要设置调度标志，调度当前进程
如果当前进程和唤醒的进程调度类相同，则通过check_preempt_curr函数去检查是否需要调度
如果当前进程比唤醒的进程的调度类高，则啥事不做

static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)

    if (wakeup_preempt_entity(se, pse) == 1) 
           resched_curr(rq);

通过wakeup_preempt_entity来判断是否可以强制当前进程，如果可以则设置need_sched标志位


/*
 * Should 'se' preempt 'curr'.
 *
 *             |s1
 *        |s2
 *   |s3
 *         g
 *      |<--->|c
 *
 *  w(c, s1) = -1
 *  w(c, s2) =  0
 *  w(c, s3) =  1
 *
 */

static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)

	s64 gran, vdiff = curr->vruntime - se->vruntime;

	if (vdiff <= 0)
		return -1;

	gran = wakeup_gran(se);
	if (vdiff > gran)
		return 1;

	return 0;

第一个参数是当前进程的调度实体，第二个参数是唤醒的进程的调度实体
vdiff的是值是当前调度实体的虚拟时间和唤醒进程调度实体的虚拟时间之差
如果vdiff小于0，则表示当前进程的虚拟时间小于唤醒的vruntime，则不抢占
wakeup_gran是用来计算唤醒的调度实体在sysctl_sched_wakeup_granularity时间内的vruntime。
大概意思就是当前进程的调度实体小于唤醒进程的调度实体的值大于gran，则才可以选择调度
如果当前进程的和唤醒进程的vrumtime的差值没达到gran的，则不选择，通过注释可以清晰的看到

总结

当一个进程通过fork创建之时，在sched_fork函数中会对此进程设置对应的调度类，设置优先级，更新vruntime的值
此时需要将进程添加到就绪队列中，对于CFS就绪队列，则需要添加到CFS红黑树中，跟踪进程的vruntime为键值添加。因为就绪队列添加了一个新的进程，则整个就绪队列的负载，权重都会发生变化，则需要重新计算。
当添加到就绪队列之后，则就需要通过pick_next回调来选择一个新的进程，选择的策略是选择CFS红黑树vruntime的进程来运行
当此进程运行一段时间后，则就会通过schedule_tick函数来判断当前进程是否运行时间超过了理想的时间，如果超过则调度出去
或者当此进程需要等待系统资源，则也会通过schedule函数去让出cpu，则就会从CFS就绪队列中移除此进程，移除一个进程同样整个CFS运行队列的权重和负载就会发生变化，则需要重新计算
当资源就绪之后，则需要将当前进程唤醒，唤醒的时候还需要检查当前进程是否会被高优先级的进程抢占，如果存在高优先级的调度类则发生抢占，如果是同等调度类的则需要判断vruntime的值是否大于一个范围，如果是则设置调度标志。

以上是关于CFS调度主要代码分析二的主要内容，如果未能解决你的问题，请参考以下文章

linux内核源码分析之CFS调度

CFS调度器——源码解析

第一次作业：关于Linux进程模型及CFS调度器分析

第一次作业：Linux 2.6.28进程模型与CFS调度器分析

Linux 内核CFS 调度器 ⑥ ( CFS 调度器就绪队列 cfs_rq | Linux 内核调度实体 sched_entity | “ 红黑树 “ 数据结构 rb_root_cached )

CFS调度器