【原創(chuàng)】（五）Linux進(jìn)程調(diào)度-CFS調(diào)度器

怡紅公子0526 2022-07-14 發(fā)布于北京

展開全文

背景

Read the fucking source code! --By 魯迅
A picture is worth a thousand words. --By 高爾基

說明：

Kernel版本：4.14
ARM64處理器，Contex-A53，雙核
使用工具：Source Insight 3.5， Visio

1. 概述

Completely Fair Scheduler，完全公平調(diào)度器，用于Linux系統(tǒng)中普通進(jìn)程的調(diào)度。
CFS采用了紅黑樹算法來管理所有的調(diào)度實體sched_entity，算法效率為O(log(n))。CFS跟蹤調(diào)度實體sched_entity的虛擬運行時間vruntime，平等對待運行隊列中的調(diào)度實體sched_entity，將執(zhí)行時間少的調(diào)度實體sched_entity排列到紅黑樹的左邊。
調(diào)度實體sched_entity通過enqueue_entity()和dequeue_entity()來進(jìn)行紅黑樹的出隊入隊。

老規(guī)矩，先上張圖片來直觀了解一下原理：

每個sched_latency周期內(nèi)，根據(jù)各個任務(wù)的權(quán)重值，可以計算出運行時間runtime；
運行時間runtime可以轉(zhuǎn)換成虛擬運行時間vruntime；
根據(jù)虛擬運行時間的大小，插入到CFS紅黑樹中，虛擬運行時間少的調(diào)度實體放置到左邊；
在下一次任務(wù)調(diào)度的時候，選擇虛擬運行時間少的調(diào)度實體來運行；

在開始本文之前，建議先閱讀下（一）Linux進(jìn)程調(diào)度器-基礎(chǔ)。

開始探索之旅！

2. 數(shù)據(jù)結(jié)構(gòu)

2.1 調(diào)度類

Linux內(nèi)核抽象了一個調(diào)度類struct sched_class，這是一種典型的面向?qū)ο蟮脑O(shè)計思想，將共性的特征抽象出來封裝成類，在實例化各個調(diào)度器的時候，可以根據(jù)具體的調(diào)度算法來實現(xiàn)。這種方式做到了高內(nèi)聚低耦合，同時又很容易擴(kuò)展新的調(diào)度器。

在調(diào)度核心代碼kernel/sched/core.c中，使用的方式是task->sched_class->xxx_func，其中task表示的是描述任務(wù)的結(jié)構(gòu)體struct task_struck，在該結(jié)構(gòu)體中包含了任務(wù)所使用的調(diào)度器，進(jìn)而能找到對應(yīng)的函數(shù)指針來完成調(diào)用執(zhí)行，有點類似于C++中的多態(tài)機(jī)制。

2.2 rq/cfs_rq/task_struct/task_group/sched_entity

struct rq：每個CPU都有一個對應(yīng)的運行隊列；
struct cfs_rq：CFS運行隊列，該結(jié)構(gòu)中包含了struct rb_root_cached紅黑樹，用于鏈接調(diào)度實體struct sched_entity。rq運行隊列中對應(yīng)了一個CFS運行隊列，此外，在task_group結(jié)構(gòu)中也會為每個CPU再維護(hù)一個CFS運行隊列；
struct task_struct：任務(wù)的描述符，包含了進(jìn)程的所有信息，該結(jié)構(gòu)中的struct sched_entity，用于參與CFS的調(diào)度；
struct task_group：組調(diào)度（參考前文），Linux支持將任務(wù)分組來對CPU資源進(jìn)行分配管理，該結(jié)構(gòu)中為系統(tǒng)中的每個CPU都分配了struct sched_entity調(diào)度實體和struct cfs_rq運行隊列，其中struct sched_entity用于參與CFS的調(diào)度；
struct sched_entity：調(diào)度實體，這個也是CFS調(diào)度管理的對象了；

來一張圖看看它們之間的組織關(guān)系：

struct sched_entity結(jié)構(gòu)體字段注釋如下：

struct sched_entity {
	/* For load-balancing: */
	struct load_weight		load;                   //調(diào)度實體的負(fù)載權(quán)重值
	struct rb_node			run_node;             //用于連接到CFS運行隊列的紅黑樹中的節(jié)點
	struct list_head		group_node;          //用于連接到CFS運行隊列的cfs_tasks鏈表中的節(jié)點
	unsigned int			on_rq;              //用于表示是否在運行隊列中

	u64				exec_start;             //當(dāng)前調(diào)度實體的開始執(zhí)行時間
	u64				sum_exec_runtime;   //調(diào)度實體執(zhí)行的總時間
	u64				vruntime;           //虛擬運行時間，這個時間用于在CFS運行隊列中排隊
	u64				prev_sum_exec_runtime;  //上一個調(diào)度實體運行的總時間

	u64				nr_migrations;      //負(fù)載均衡

	struct sched_statistics		statistics;     //統(tǒng)計信息

#ifdef CONFIG_FAIR_GROUP_SCHED
	int				depth;      //任務(wù)組的深度，其中根任務(wù)組的深度為0，逐級往下增加
	struct sched_entity		*parent;        //指向調(diào)度實體的父對象
	/* rq on which this entity is (to be) queued: */
	struct cfs_rq			*cfs_rq;        //指向調(diào)度實體歸屬的CFS隊列，也就是需要入列的CFS隊列
	/* rq "owned" by this entity/group: */
	struct cfs_rq			*my_q;      //指向歸屬于當(dāng)前調(diào)度實體的CFS隊列，用于包含子任務(wù)或子的任務(wù)組
#endif

#ifdef CONFIG_SMP
	/*
	 * Per entity load average tracking.
	 *
	 * Put into separate cache line so it does not
	 * collide with read-mostly values above.
	 */
	struct sched_avg		avg ____cacheline_aligned_in_smp;   //用于調(diào)度實體的負(fù)載計算（`PELT`）
#endif
};

struct cfs_rq結(jié)構(gòu)體的關(guān)鍵字段注釋如下：

/* CFS-related fields in a runqueue */
struct cfs_rq {
	struct load_weight load;        //CFS運行隊列的負(fù)載權(quán)重值
	unsigned int nr_running, h_nr_running;  //nr_running：運行的調(diào)度實體數(shù)（參與時間片計算）

	u64 exec_clock;     //運行時間
	u64 min_vruntime;   //最少的虛擬運行時間，調(diào)度實體入隊出隊時需要進(jìn)行增減處理
#ifndef CONFIG_64BIT
	u64 min_vruntime_copy;
#endif

	struct rb_root_cached tasks_timeline;   //紅黑樹，用于存放調(diào)度實體

	/*
	 * 'curr' points to currently running entity on this cfs_rq.
	 * It is set to NULL otherwise (i.e when none are currently running).
	 */
	struct sched_entity *curr, *next, *last, *skip; //分別指向當(dāng)前運行的調(diào)度實體、下一個調(diào)度的調(diào)度實體、CFS運行隊列中排最后的調(diào)度實體、跳過運行的調(diào)度實體

#ifdef	CONFIG_SCHED_DEBUG
	unsigned int nr_spread_over;
#endif

#ifdef CONFIG_SMP
	/*
	 * CFS load tracking
	 */
	struct sched_avg avg;       //計算負(fù)載相關(guān)
	u64 runnable_load_sum;
	unsigned long runnable_load_avg;        //基于PELT的可運行平均負(fù)載
#ifdef CONFIG_FAIR_GROUP_SCHED
	unsigned long tg_load_avg_contrib;      //任務(wù)組的負(fù)載貢獻(xiàn)
	unsigned long propagate_avg;
#endif
	atomic_long_t removed_load_avg, removed_util_avg;
#ifndef CONFIG_64BIT
	u64 load_last_update_time_copy;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
	/*
	 *   h_load = weight * f(tg)
	 *
	 * Where f(tg) is the recursive weight fraction assigned to
	 * this group.
	 */
	unsigned long h_load;
	u64 last_h_load_update;
	struct sched_entity *h_load_next;
#endif /* CONFIG_FAIR_GROUP_SCHED */
#endif /* CONFIG_SMP */

#ifdef CONFIG_FAIR_GROUP_SCHED
	struct rq *rq;	/* cpu runqueue to which this cfs_rq is attached */     //指向CFS運行隊列所屬的CPU RQ運行隊列

	/*
	 * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
	 * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
	 * (like users, containers etc.)
	 *
	 * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This
	 * list is used during load balance.
	 */
	int on_list;
	struct list_head leaf_cfs_rq_list;
	struct task_group *tg;	/* group that "owns" this runqueue */       //CFS運行隊列所屬的任務(wù)組

#ifdef CONFIG_CFS_BANDWIDTH
	int runtime_enabled;    //CFS運行隊列中使用CFS帶寬控制
	u64 runtime_expires;    //到期的運行時間
	s64 runtime_remaining;      //剩余的運行時間

	u64 throttled_clock, throttled_clock_task;  //限流時間相關(guān)
	u64 throttled_clock_task_time;
	int throttled, throttle_count;      //throttled：限流，throttle_count：CFS運行隊列限流次數(shù)
	struct list_head throttled_list;    //運行隊列限流鏈表節(jié)點，用于添加到cfs_bandwidth結(jié)構(gòu)中的cfttle_cfs_rq鏈表中
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
};

3. 流程分析

整個流程分析，圍繞著CFS調(diào)度類實體：fair_sched_class中的關(guān)鍵函數(shù)來展開。

先來看看fair_sched_class都包含了哪些函數(shù)：

/*
 * All the scheduling class methods:
 */
const struct sched_class fair_sched_class = {
	.next			= &idle_sched_class,
	.enqueue_task		= enqueue_task_fair,
	.dequeue_task		= dequeue_task_fair,
	.yield_task		= yield_task_fair,
	.yield_to_task		= yield_to_task_fair,

	.check_preempt_curr	= check_preempt_wakeup,

	.pick_next_task		= pick_next_task_fair,
	.put_prev_task		= put_prev_task_fair,

#ifdef CONFIG_SMP
	.select_task_rq		= select_task_rq_fair,
	.migrate_task_rq	= migrate_task_rq_fair,

	.rq_online		= rq_online_fair,
	.rq_offline		= rq_offline_fair,

	.task_dead		= task_dead_fair,
	.set_cpus_allowed	= set_cpus_allowed_common,
#endif

	.set_curr_task          = set_curr_task_fair,
	.task_tick		= task_tick_fair,
	.task_fork		= task_fork_fair,

	.prio_changed		= prio_changed_fair,
	.switched_from		= switched_from_fair,
	.switched_to		= switched_to_fair,

	.get_rr_interval	= get_rr_interval_fair,

	.update_curr		= update_curr_fair,

#ifdef CONFIG_FAIR_GROUP_SCHED
	.task_change_group	= task_change_group_fair,
#endif
};

3.1 runtime與vruntime

CFS調(diào)度器沒有時間片的概念了，而是根據(jù)實際的運行時間和虛擬運行時間來對任務(wù)進(jìn)行排序，從而選擇調(diào)度。
那么，運行時間和虛擬運行時間是怎么計算的呢？看一下流程調(diào)用：

Linux內(nèi)核默認(rèn)的sysctl_sched_latency是6ms，這個值用戶態(tài)可設(shè)。sched_period用于保證可運行任務(wù)都能至少運行一次的時間間隔；
當(dāng)可運行任務(wù)大于8個的時候，sched_period的計算則需要根據(jù)任務(wù)個數(shù)乘以最小調(diào)度顆粒值，這個值系統(tǒng)默認(rèn)為0.75ms；
每個任務(wù)的運行時間計算，是用sched_period值，去乘以該任務(wù)在整個CFS運行隊列中的權(quán)重占比；
虛擬運行的時間 = 實際運行時間 * NICE_0_LOAD / 該任務(wù)的權(quán)重；

還是來看一個實例吧，以5個Task為例，其中每個Task的nice值不一樣（優(yōu)先級不同），對應(yīng)到的權(quán)重值在內(nèi)核中提供了一個轉(zhuǎn)換數(shù)組：

const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

圖來了：

3.2 CFS調(diào)度tick

CFS調(diào)度器中的tick函數(shù)為task_tick_fair，系統(tǒng)中每個調(diào)度tick都會調(diào)用到，此外如果使用了hrtimer，也會調(diào)用到這個函數(shù)。
流程如下：

主要的工作包括：

更新運行時的各類統(tǒng)計信息，比如vruntime，運行時間、負(fù)載值、權(quán)重值等；
檢查是否需要搶占，主要是比較運行時間是否耗盡，以及vruntime的差值是否大于運行時間等；

來一張圖，感受一下update_curr函數(shù)的相關(guān)信息更新吧：

3.3 任務(wù)出隊入隊

當(dāng)任務(wù)進(jìn)入可運行狀態(tài)時，需要將調(diào)度實體放入到紅黑樹中，完成入隊操作；
當(dāng)任務(wù)退出可運行狀態(tài)時，需要將調(diào)度實體從紅黑樹中移除，完成出隊操作；
CFS調(diào)度器，使用enqueue_task_fair函數(shù)將任務(wù)入隊到CFS隊列，使用dequeue_task_fair函數(shù)將任務(wù)從CFS隊列中出隊操作。

出隊與入隊的操作中，核心的邏輯可以分成兩部分：1）更新運行時的數(shù)據(jù)，比如負(fù)載、權(quán)重、組調(diào)度的占比等等；2）將sched_entity插入紅黑樹，或者從紅黑樹移除；
由于dequeue_task_fair大體的邏輯類似，不再深入分析；
這個過程中，涉及到了CPU負(fù)載計算、task_group組調(diào)度、CFS Bandwidth帶寬控制等，這些都在前邊的文章中分析過，可以結(jié)合進(jìn)行理解；

3.3 任務(wù)創(chuàng)建

在父進(jìn)程通過fork創(chuàng)建子進(jìn)程的時候，task_fork_fair函數(shù)會被調(diào)用，這個函數(shù)的傳入?yún)?shù)是子進(jìn)程的task_struct。該函數(shù)的主要作用，就是確定子任務(wù)的vruntime，因此也能確定子任務(wù)的調(diào)度實體在紅黑樹RB中的位置。

task_fork_fair本身比較簡單，流程如下圖：

3.4 任務(wù)選擇

每當(dāng)進(jìn)程任務(wù)切換的時候，也就是schedule函數(shù)執(zhí)行時，調(diào)度器都需要選擇下一個將要執(zhí)行的任務(wù)。
在CFS調(diào)度器中，是通過pick_next_task_fair函數(shù)完成的，流程如下：

當(dāng)需要進(jìn)程任務(wù)切換的時候，pick_next_task_fair函數(shù)的傳入?yún)?shù)中包含了需要被切換出去的任務(wù)，也就是pre_task；
當(dāng)pre_task不是普通進(jìn)程時，也就是調(diào)度類不是CFS，那么它就不使用sched_entity的調(diào)度實體來參與調(diào)度，因此會執(zhí)行simple分支，通過put_pre_task函數(shù)來通知系統(tǒng)當(dāng)前的任務(wù)需要被切換，而不是通過put_prev_entity函數(shù)來完成；
當(dāng)pre_task是普通進(jìn)程時，調(diào)用pick_next_entity來選擇下一個執(zhí)行的任務(wù)，這個選擇過程實際是有兩種情況：當(dāng)調(diào)度實體對應(yīng)task時，do while()遍歷一次，當(dāng)調(diào)度實體對應(yīng)task_group是，則需要遍歷任務(wù)組來選擇下一個執(zhí)行的任務(wù)了。
put_prev_entity，用于切換任務(wù)前的準(zhǔn)備工作，更新運行時的統(tǒng)計數(shù)據(jù)，并不進(jìn)行dequeue的操作，其中需要將CFS隊列的curr指針置位成NULL；
set_next_entity，用于設(shè)置下一個要運行的調(diào)度實體，設(shè)置CFS隊列的curr指針；
如果使能了hrtimer，則將hrtimer的到期時間設(shè)置為調(diào)度實體的剩余運行時間；