linux进程的调度

进程状态

进程主要有7种状态：就绪状态、运行状态、轻度睡眠、中度睡眠、深度睡眠、僵尸状态、死亡状态。它们之间状态变迁如下：
在这里插入图片描述
进程描述符task_struct结构体中有个成员state专门用来描述进程的状态。

就绪状态：state为TASK_RUNNING，正在运行队列中等待调度器调度。
运行状态：state为TASK_RUNNING，被调度器命中，正在处理器中运行。
轻度睡眠：称为可打断的睡眠状态。state为TASK_INTERRUPTIBLE，可以被信号打断。
中度睡眠：state为TASK_KILLABLE，只能被致命的信号打断。
深度睡眠：不可打断的睡眠状态。state为TASK_UNINTERRUPTIBLE，不能被信号打断。
死亡状态：state为TASK_DEAD。在task_struct中还另外有一个字段exit_state为EXIT_DEAD。
僵尸状态：state为TASK_DEAD。在task_struct中还另外有一个字段exit_state为EXIT_ZOMBIE。如果父进程关注子进程退出事件，那么子进程在退出时
发送SIGCHILD信号通知父进程，变成僵尸进程。父进程在查询子进程终止原因后收回子进程的进程描述符。

调度算法原则

良好的调度算法应该考虑以下几个问题：

公平：保证每个进程得到合理的cpu时间。
高效：使cpu保持忙碌状态，即总是有进程在cpu上执行
响应时间：使交互用户的响应时间尽可能短
周转时间：使批处理用户等待输出的时间尽可能短
吞吐量：使单位时间内处理的进程数量尽可能多

举几个遵循以上原则的例子，比如unix系统采用的动态优先调度，windows的抢先多任务调度。

进程调度的依据

在include/sched.h的task_struct结构体中，有以下几个成员，注意这些成员的代码不是在一起的，这里只是把它们贴在一起了。

//在调度时机到来时检测该值，如果为1则调用schedule()
volatile long need_resched;
//进程处于运行状态时所剩的时钟嘀嗒数。每次时钟中断到来时，这个值减1，直到减为0
//如果该值为0，就把need_resched置为1。这个值也称为动态优先级
long counter;
//进程的静态优先级。这个值决定了counter的初值
long nice;
//实时进程的优先级
unsigned long rt_priority;
//从整体上区分实时进程和普通进程，因为实时进程和普通进程的调度不同，实时进程应当优于普通进程执行。可以通过系统调用sched_setscheduler()来改变调度的策略
unsigned long policy;
1
2
3
4
5
6
7
8
9
10
11

进程调度策略

进程优先级

限期进程的优先级比实时进程高，实时进程的优先级比普通进程高。
限制进程的优先级是-1。
实时进程的实时优先级是1-99，优先级数值越大，表示优先级越高。
普通进程的静态优先级是100-139，优先级值越小，表示优先级越高，可通过修改nice值改变普通进程的优先级，优先级等于120加上nice值。
在task_struct结构体中，4个成员和优先级有关如下：

int				prio;
int				static_prio;
int				normal_prio;
unsigned int			rt_priority;
1
2
3
4

具体来说这4个成员的关系如下：
在这里插入图片描述
有一点值得注意，如果优先级低的进程占有实时互斥锁，此时又有优先级高的进程等待实时互斥锁，这时会把占有实时互斥锁的进程的优先级临时提高到等待实时互斥锁的进程的优先级，这个现象叫优先级继承。

限期调度策略

有三个参数：运行时间runtime、截止期限deadline和周期period。每个周期运行一次，一次运行的时长为runtime。
在这里插入图片描述
具体的调度策略：选取一个绝对截止期限最小的进程，执行了runtime时长后，如果还未完成就让出cpu，并在运行队列中删除，下一个周期开始时再加入。

实时进程调度策略

支持两种调度策略。

先进先出调度策略：没有时间片，如果没有更高优先级的实时进程，并且该实时进程不睡眠，那么它将一直占有处理器。
轮流调度策略：有时间片，进程用完时间片以后会加入优先级对应运行队列的队尾，把处理器给优先级相同的其他进程。

普通进程调度策略

标准轮流分时调度策略：使用完全公平调度算法，把处理器时间公平分给每个进程。
空闲调度策略：该策略用来执行优先级非常低的进程，优先级比使用标准轮流分时调度策略中相对优先级（nice）为19的普通进程还低。进程的相对优先级对空闲调度策略没有影响。

调度类

注意是调度器里面包含进程，而不是进程里包含调度器，特别是不要因为看见task_struct中有个sched_class*成员，就记错了。不同的进程对应不同的调度器，但是一个进程只能属于一个调度器。

停机调度类

停机调度类是优先级最高的调度类，支持限期调度类。停机进程优先级最高，可抢占其他所有进程。没有时间片，如果不主动让出cpu就会一直占有。目前只有停机线程属于停机调度类。设计停机调度类的目的，是为了让迁移线程的优先级比限期进程高，才能快速处理调度器发出的迁移指令，把进程从当前处理器迁移到其他处理器。停机进程对外伪装成最高优先级的实时进程。

限期调度类

使用优先算法（使用红黑树）把进程按绝对截止期限从小到大排序，每次调度时选择绝对截止期限最小的进程。

实时调度类

为每个优先级维护一个源码。

/*
 * This is the priority-queue data structure of the RT scheduling class:
 */
struct rt_prio_array {
	DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */
	struct list_head queue[MAX_RT_PRIO];
};
1
2
3
4
5
6
7

位图bitmap用来快速查找第一个非空队列，数组queue的下标是实时进程调度优先级，下标越小，优先级越高。

公平调度类

使用完全公平调度算法。引入了虚拟运行时间。
虚拟运行时间=实际运行时间*nice0对应的权重（1024）/进程的权重
nice值n-1是n对应的权重约1.25倍。

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

使用空闲调度的普通进程，权重为3。

空闲调度类

每个处理器上有一个空闲线程，即0号线程。空闲调度类的优先级最低，仅当没有其他进程可以调度时，才会执行空闲线程。

总结一下
在这里插入图片描述

运行队列

每个处理器有一个运行队列，结构体是rq，定义的全局变量如下：

DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
1

rq描述就绪队列，其设计是为每个CPU一个就绪队列，本地进程在本地队列上排序。

调度器

调度器函数

调度器的实现基于两个函数：周期性调度器函数和主调度器函数。这些函数根据现有进程的优先级分配CPU时间。这也是为什么整个方法称之为优先调度的原因。

周期性调度器函数

周期性调度器在scheduler_tick中实现，如果系统正在活动中，内核会按照频率HZ自动调用该函数。该函数主要有两个任务：

更新相关统计量：管理内核中与整个系统和各个进程的调度相关的统计量。其间执行的主要操作是对各种计数器加1。
激活负责当前进程的调度类的周期性调度方法。

/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 */
void scheduler_tick(void)
{
    //获取当前cpu上的全局就绪队列rq和当前运行进程curr
    //在SMP的情况下，获取当前cpu的id，如果不是则返回0
	int cpu = smp_processor_id();
    //获得cpu的全局就绪队列rq，每个cpu都有一个就绪队列rq
	struct rq *rq = cpu_rq(cpu);
    //获取就绪队列上正在运行的进程curr
	struct task_struct *curr = rq->curr;
	struct rq_flags rf;

	sched_clock_tick();

	rq_lock(rq, &rf);

	//更新rq当前的时间戳，即rq->clock变为当前时间戳
    //处理就绪队列时钟的更新，本质上是增加struct rq当前实例的时钟时间戳
    update_rq_clock(rq);

    //由于调度器的模块化结构，主要工作可以完全由特定调度器类方法实现，
    //task_tick实现模式取决于底层的调度器类
    //执行当前运行进程所在调度类的task_tick()函数，进行周期性调度
	curr->sched_class->task_tick(rq, curr, 0);
    //将当前负荷加入到数组的第一个位置
	cpu_load_update_active(rq);
    //更新全局cpu的就绪队列calc_load_update
    //更新cpu的活动计数，主要是更新全局cpu就绪队列calc_load_update
	calc_global_load_tick(rq);

	rq_unlock(rq, &rf);
    //与perf计数事件相关
	perf_event_task_tick();

#ifdef CONFIG_SMP  //配置相关的SMP
    //判断cpu是否为空闲状态
	rq->idle_balance = idle_cpu(cpu);
    //如果进行周期性负载均衡，则触发SCHED_SOFTIRQ
	trigger_load_balance(rq);
#endif
	rq_last_tick_reset(rq);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

再看下里面主要的两个任务

void update_rq_clock(struct rq *rq)
{
	s64 delta;

	lockdep_assert_held(&rq->lock);

	if (rq->clock_update_flags & RQCF_ACT_SKIP)
		return;

#ifdef CONFIG_SCHED_DEBUG
	if (sched_feat(WARN_DOUBLE_CLOCK))
		SCHED_WARN_ON(rq->clock_update_flags & RQCF_UPDATED);
	rq->clock_update_flags |= RQCF_UPDATED;
#endif

	delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
	if (delta < 0)
		return;
	rq->clock += delta;
	update_rq_clock_task(rq, delta);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

/*
 * Called from scheduler_tick() to periodically update this CPU's
 * active count.
 */
void calc_global_load_tick(struct rq *this_rq)
{
	long delta;

	if (time_before(jiffies, this_rq->calc_load_update))
		return;

	delta  = calc_load_fold_active(this_rq, 0);
	if (delta)
		atomic_long_add(delta, &calc_load_tasks);

	this_rq->calc_load_update += LOAD_FREQ;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

里面还有具体的函数调用，就不再继续跟下去了。

主调度器函数

在内核中的许多地方，如果要将CPU分配给与当前活动进程不同的另一个进程，都会直接调用主调度器函数（schedule）。

asmlinkage __visible void __sched schedule(void)
{
	struct task_struct *tsk = current;

	sched_submit_work(tsk);
	do {
		preempt_disable();
		__schedule(false);
		sched_preempt_enable_no_resched();
	} while (need_resched());
}
EXPORT_SYMBOL(schedule);
1
2
3
4
5
6
7
8
9
10
11
12

调度进程

主动调度进程的函数是schedule() ，即主调度器函数，该函数会把主要工作委托给__schedule()去处理。

/*
 * __schedule() is the main scheduler function.
 *
 * The main means of driving the scheduler and thus entering this function are:
 *
 *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.
 *
 *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
 *      paths. For example, see arch/x86/entry_64.S.
 *
 *      To drive preemption between tasks, the scheduler sets the flag in timer
 *      interrupt handler scheduler_tick().
 *
 *   3. Wakeups don't really cause entry into schedule(). They add a
 *      task to the run-queue and that's it.
 *
 *      Now, if the new task added to the run-queue preempts the current
 *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
 *      called on the nearest possible occasion:
 *
 *       - If the kernel is preemptible (CONFIG_PREEMPT=y):
 *
 *         - in syscall or exception context, at the next outmost
 *           preempt_enable(). (this might be as soon as the wake_up()'s
 *           spin_unlock()!)
 *
 *         - in IRQ context, return from interrupt-handler to
 *           preemptible context
 *
 *       - If the kernel is not preemptible (CONFIG_PREEMPT is not set)
 *         then at the next:
 *
 *          - cond_resched() call
 *          - explicit schedule() call
 *          - return from syscall or exception to user-space
 *          - return from interrupt-handler to user-space
 *
 * WARNING: must be called with preemption disabled!
 */
//参数preempt表示是否抢占调度
static void __sched notrace __schedule(bool preempt)
{
	struct task_struct *prev, *next;
	unsigned long *switch_count;
	struct rq_flags rf;
	struct rq *rq;
	int cpu;

	cpu = smp_processor_id();
	rq = cpu_rq(cpu);
	prev = rq->curr;

	schedule_debug(prev);

	if (sched_feat(HRTICK))
		hrtick_clear(rq);

	local_irq_disable();
	rcu_note_context_switch(preempt);

	/*
	 * Make sure that signal_pending_state()->signal_pending() below
	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
	 * done by the caller to avoid the race with signal_wake_up().
	 */
	smp_mb__before_spinlock();
	rq_lock(rq, &rf);

	/* Promote REQ to ACT */
	rq->clock_update_flags <<= 1;
	update_rq_clock(rq);

	switch_count = &prev->nivcsw;
	if (!preempt && prev->state) {
		if (unlikely(signal_pending_state(prev->state, prev))) {
			prev->state = TASK_RUNNING;
		} else {
			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
			prev->on_rq = 0;

			if (prev->in_iowait) {
				atomic_inc(&rq->nr_iowait);
				delayacct_blkio_start();
			}

			/*
			 * If a worker went to sleep, notify and ask workqueue
			 * whether it wants to wake up a task to maintain
			 * concurrency.
			 */
			if (prev->flags & PF_WQ_WORKER) {
				struct task_struct *to_wakeup;

				to_wakeup = wq_worker_sleeping(prev);
				if (to_wakeup)
					try_to_wake_up_local(to_wakeup, &rf);
			}
		}
		switch_count = &prev->nvcsw;
	}

    //此函数针对公平调度类做了优化
    //如果当前进程属于公平调度类或空闲调度类，并且所有运行队列中的进程属于公平调度类，就直接调用；
    //如果公平调度类没有选中下一个进程。则从空闲调度类中选择下一个进程
	next = pick_next_task(rq, prev, &rf);
	clear_tsk_need_resched(prev);
	clear_preempt_need_resched();

	if (likely(prev != next)) {
		rq->nr_switches++;
		rq->curr = next;
		++*switch_count;

		trace_sched_switch(preempt, prev, next);

		/* Also unlocks the rq: */
		rq = context_switch(rq, prev, next, &rf);
	} else {
		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
		rq_unlock_irq(rq, &rf);
	}

	balance_callback(rq);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124

函数__shcedule比较长，这里简要总结一下，其主要处理过程如下：

调用pick_next_task()以选择下一个进程。注意每个调度类都有自己的pick_next_task函数，比如停机调度类pick_next_task_stop()、限期调度类pick_next_task_dl()等，一共5个。
调用context_switch()以切换进程。

介绍一下切换进程。

/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next, struct rq_flags *rf)
{
	struct mm_struct *mm, *oldmm;
    //执行进程切换的准备工作
	prepare_task_switch(rq, prev, next);

	mm = next->mm;
	oldmm = prev->active_mm;
	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
    //开始上下文切换，是每种处理器架构必须定义的函数
	arch_start_context_switch(prev);

    //如果下一个进程是内核线程（内核线程没有用户虚拟地址空间）
	if (!mm) {
        //借用上一个进程的用户虚拟地址空间，把借来的空间保存在成员active_mm中
		next->active_mm = oldmm;
		mmgrab(oldmm);
        //此函数通知处理器架构不需要切换用户虚拟地址空间，这种加速进程切换的技术叫TLB
		enter_lazy_tlb(oldmm, next);
	} else
        //如果下一个进程是用户进程，那么就调用此函数切换进程的用户虚拟地址空间
		switch_mm_irqs_off(oldmm, mm, next);

    //如果上一个进程是内核线程，就把成员active_mm置空，断开它与借用的用户虚拟地址空间的联系
    //把它借用的用户虚拟地址空间保存在运行队列的成员prev_mm中
	if (!prev->mm) {
		prev->active_mm = NULL;
		rq->prev_mm = oldmm;
	}

	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

	/*
	 * Since the runqueue lock will be released by the next
	 * task (which is an invalid locking op but in the case
	 * of the scheduler it's an obvious special-case), so we
	 * do an early lockdep release here:
	 */
	rq_unpin_lock(rq, rf);
	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);

	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);
    //编译优化屏障，防止内核编译时调整运行顺序
	barrier();

	return finish_task_switch(prev);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57

该函数有两个操作：切换用户虚拟地址空间和切换寄存器。

切换用户虚拟地址空间

ARM64架构使用默认的switch_mm_irqs_off

#ifndef switch_mm_irqs_off
# define switch_mm_irqs_off switch_mm
#endif
1
2
3

看到arm架构中代码

static inline void __switch_mm(struct mm_struct *next)
{
	unsigned int cpu = smp_processor_id();

	/*
	 * init_mm.pgd does not contain any user mappings and it is always
	 * active for kernel addresses in TTBR1. Just set the reserved TTBR0.
	 */
	if (next == &init_mm) {
		cpu_set_reserved_ttbr0();
		return;
	}

	check_and_switch_context(next, cpu);
}

static inline void
switch_mm(struct mm_struct *prev, struct mm_struct *next,
	  struct task_struct *tsk)
{
	if (prev != next)
		__switch_mm(next);

	/*
	 * Update the saved TTBR0_EL1 of the scheduled-in task as the previous
	 * value may have not been initialised yet (activate_mm caller) or the
	 * ASID has changed since the last run (following the context switch
	 * of another thread of the same process). Avoid setting the reserved
	 * TTBR0_EL1 to swapper_pg_dir (init_mm; e.g. via idle_task_exit).
	 */
	if (next != &init_mm)
		update_saved_ttbr0(tsk, next);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

切换寄存器

宏switch_to把这项工作委托给函数__switch_to

#define switch_to(prev, next, last)					\
	do {								\
		((last) = __switch_to((prev), (next)));			\
	} while (0)
1
2
3
4

一点一点往下跟，arm64架构中

/*
 * Thread switching.
 */
__notrace_funcgraph struct task_struct *__switch_to(struct task_struct *prev,
				struct task_struct *next)
{
	struct task_struct *last;

    //切换浮点寄存器
    //为什么要切换？因为不同处理器架构的浮点寄存器可能不同，而且有的处理器架构不支持浮点运算，所以各种处理器架构需要自己实现浮点运算
    //arm64支持浮点运算和单指令多数据（SIMD）功能，共用32个128位寄存器
	fpsimd_thread_switch(next);
	//切换线程本地存储相关的寄存器
    tls_thread_switch(next);
    //切换调度寄存器
	hw_breakpoint_thread_switch(next);
    //切换上下文标识符寄存器
	contextidr_thread_switch(next);
    //使用当前处理器每处理器变量，记录下一个进程的进程描述符地址
	entry_task_switch(next);
	uao_thread_switch(next);

	/*
	 * Complete any pending TLB or cache maintenance on this CPU in case
	 * the thread migrates to a different CPU.
	 */
    //在这个处理器上执行完前面的所有页表缓存或缓存维护操作，防止线程迁移到其他处理器
	dsb(ish);

	/* the actual thread switch */
    //实际的线程切换
	last = cpu_switch_to(prev, next);

	return last;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

调度时机

调度进程的时机如下：

进程主动调用schedule()函数。
周期性地调度，抢占当前进程，强迫当前进程让出处理器。
唤醒进程的时候，被唤醒的进程可能抢占当前进程。
创建新进程的时候，新进程可能抢占当前进程。

如果我们编译内核时开启对内核抢占的支持，那么内核会增加一些抢占点。

主动调度

进程在用户模式下运行的时候，无法直接调用schedule()函数，只能通过系统调用进入内核模式，如果系统调用需要等待某个资源，如互斥锁或信号量，就会把进程的状态设置为睡眠状态，然后调用schedule()函数来调度进程。
进程也可以通过系统调用shced_yield()让出处理器，这种情况下进程不会睡眠。
在内核中有3种主动调度方式：

直接调用schedule()函数来调用进程。
调用有条件重调度函数cond_resched()。
如果需要等待某个资源。

周期调度

有些“地痞流氓”进程不主动让出处理器，内核只能依靠周期性的时钟中断夺回处理器的控制权，时钟中断是调度器的脉博。时钟中断处理程序检查当前进程的执行时间有没有超过限额，如果超过限额，设置需要重新调度的标志。当时钟中断处理程序准备返点处理器还给被打断的进程时，如果被打断的进程在用户模式下运行，就检查有没有设置需要重新调度的标志，如果设置了，调用schedule函数以调度进程。如果需要重新调度，就为当前进程的thread_info结构体的成员flags设置需要重新调度的标志。

SMP调度

SMP是对称多处理器系统。在SMP系统中，进程调度器必须支持如下：

需要使用每个处理器的负载尽可能均衡。
可以设置进程的处理器亲和性，即允许进程在哪些处理器上执行。
可以把进程从一个处理器迁移到另一个处理器。

进程的处理器亲和性

设置进程的处理器亲和性，通俗就是把进程绑定到某些处理器，只允许进程在某些处理器上执行，默认情况是进程可以在所有处理器上执行。进程的task_struct中有一个成员cpus_allowed,表示本进程可以在哪些CPU上运行。
应用编程接口，只有2个系统调用，一个是设置另一个是获取。

//设置进程的处理器亲和性掩码
long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
{
	cpumask_var_t cpus_allowed, new_mask;
	struct task_struct *p;
	int retval;

	rcu_read_lock();

	p = find_process_by_pid(pid);
	if (!p) {
		rcu_read_unlock();
		return -ESRCH;
	}

	/* Prevent p going away */
	get_task_struct(p);
	rcu_read_unlock();

	if (p->flags & PF_NO_SETAFFINITY) {
		retval = -EINVAL;
		goto out_put_task;
	}
	if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
		retval = -ENOMEM;
		goto out_put_task;
	}
	if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {
		retval = -ENOMEM;
		goto out_free_cpus_allowed;
	}
	retval = -EPERM;
	if (!check_same_owner(p)) {
		rcu_read_lock();
		if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
			rcu_read_unlock();
			goto out_free_new_mask;
		}
		rcu_read_unlock();
	}

	retval = security_task_setscheduler(p);
	if (retval)
		goto out_free_new_mask;


	cpuset_cpus_allowed(p, cpus_allowed);
	cpumask_and(new_mask, in_mask, cpus_allowed);

	/*
	 * Since bandwidth control happens on root_domain basis,
	 * if admission test is enabled, we only admit -deadline
	 * tasks allowed to run on all the CPUs in the task's
	 * root_domain.
	 */
#ifdef CONFIG_SMP
	if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
		rcu_read_lock();
		if (!cpumask_subset(task_rq(p)->rd->span, new_mask)) {
			retval = -EBUSY;
			rcu_read_unlock();
			goto out_free_new_mask;
		}
		rcu_read_unlock();
	}
#endif
again:
	retval = __set_cpus_allowed_ptr(p, new_mask, true);

	if (!retval) {
		cpuset_cpus_allowed(p, cpus_allowed);
		if (!cpumask_subset(new_mask, cpus_allowed)) {
			/*
			 * We must have raced with a concurrent cpuset
			 * update. Just reset the cpus_allowed to the
			 * cpuset's cpus_allowed
			 */
			cpumask_copy(new_mask, cpus_allowed);
			goto again;
		}
	}
out_free_new_mask:
	free_cpumask_var(new_mask);
out_free_cpus_allowed:
	free_cpumask_var(cpus_allowed);
out_put_task:
	put_task_struct(p);
	return retval;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89

//获取进程的处理器亲和性掩码
long sched_getaffinity(pid_t pid, struct cpumask *mask)
{
	struct task_struct *p;
	unsigned long flags;
	int retval;

	rcu_read_lock();

	retval = -ESRCH;
	p = find_process_by_pid(pid);
	if (!p)
		goto out_unlock;

	retval = security_task_getscheduler(p);
	if (retval)
		goto out_unlock;

	raw_spin_lock_irqsave(&p->pi_lock, flags);
	cpumask_and(mask, &p->cpus_allowed, cpu_active_mask);
	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

out_unlock:
	rcu_read_unlock();

	return retval;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

内核中有两个函数用以设置处理器亲和性掩码

//把一个刚创建的进程绑定到一个处理器
static void __kthread_bind(struct task_struct *p, unsigned int cpu, long state)
{
	__kthread_bind_mask(p, cpumask_of(cpu), state);
}
1
2
3
4
5

具体流程就不再往下跟了。

//设置内核线程的处理器亲和性掩码
void (*set_cpus_allowed)(struct task_struct *p, const struct cpumask *newmask);
1
2

具体定义就不看了。

限期调度类的处理器负载均衡

限期调度类的处理器负载均衡简单，调度选择下一个限期进程的时候，如果当前正在执行的进程是限期进程，将会试图从限期进程超载的处理器把限期进程搞过来。
限期进程超载定义：

限期运行队列至少有两个限期进程。
至少有一个限期进程绑定到多个处理器。

static void pull_dl_task(struct rq *this_rq)
{
	int this_cpu = this_rq->cpu, cpu;
	struct task_struct *p;
	bool resched = false;
	struct rq *src_rq;
	u64 dmin = LONG_MAX;

	if (likely(!dl_overloaded(this_rq)))
		return;

	/*
	 * Match the barrier from dl_set_overloaded; this guarantees that if we
	 * see overloaded we must also see the dlo_mask bit.
	 */
	smp_rmb();

	for_each_cpu(cpu, this_rq->rd->dlo_mask) {
		if (this_cpu == cpu)
			continue;

		src_rq = cpu_rq(cpu);

		/*
		 * It looks racy, abd it is! However, as in sched_rt.c,
		 * we are fine with this.
		 */
		if (this_rq->dl.dl_nr_running &&
		    dl_time_before(this_rq->dl.earliest_dl.curr,
				   src_rq->dl.earliest_dl.next))
			continue;

		/* Might drop this_rq->lock */
		double_lock_balance(this_rq, src_rq);

		/*
		 * If there are no more pullable tasks on the
		 * rq, we're done with it.
		 */
		if (src_rq->dl.dl_nr_running <= 1)
			goto skip;

		p = pick_earliest_pushable_dl_task(src_rq, this_cpu);

		/*
		 * We found a task to be pulled if:
		 *  - it preempts our current (if there's one),
		 *  - it will preempt the last one we pulled (if any).
		 */
		if (p && dl_time_before(p->dl.deadline, dmin) &&
		    (!this_rq->dl.dl_nr_running ||
		     dl_time_before(p->dl.deadline,
				    this_rq->dl.earliest_dl.curr))) {
			WARN_ON(p == src_rq->curr);
			WARN_ON(!task_on_rq_queued(p));

			/*
			 * Then we pull iff p has actually an earlier
			 * deadline than the current task of its runqueue.
			 */
			if (dl_time_before(p->dl.deadline,
					   src_rq->curr->dl.deadline))
				goto skip;

			resched = true;

			deactivate_task(src_rq, p, 0);
			set_task_cpu(p, this_cpu);
			activate_task(this_rq, p, 0);
			dmin = p->dl.deadline;

			/* Is there any other task even earlier? */
		}
skip:
		double_unlock_balance(this_rq, src_rq);
	}

	if (resched)
		resched_curr(this_rq);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

看下调度

/*
 * Update the current task's runtime statistics (provided it is still
 * a -deadline task and has not been removed from the dl_rq).
 */
static void update_curr_dl(struct rq *rq)
{
	struct task_struct *curr = rq->curr;
	struct sched_dl_entity *dl_se = &curr->dl;
	u64 delta_exec;

	if (!dl_task(curr) || !on_dl_rq(dl_se))
		return;

	/*
	 * Consumed budget is computed considering the time as
	 * observed by schedulable tasks (excluding time spent
	 * in hardirq context, etc.). Deadlines are instead
	 * computed using hard walltime. This seems to be the more
	 * natural solution, but the full ramifications of this
	 * approach need further study.
	 */
	delta_exec = rq_clock_task(rq) - curr->se.exec_start;
	if (unlikely((s64)delta_exec <= 0)) {
		if (unlikely(dl_se->dl_yielded))
			goto throttle;
		return;
	}

	/* kick cpufreq (see the comment in kernel/sched/sched.h). */
	cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_DL);

	schedstat_set(curr->se.statistics.exec_max,
		      max(curr->se.statistics.exec_max, delta_exec));

	curr->se.sum_exec_runtime += delta_exec;
	account_group_exec_runtime(curr, delta_exec);

	curr->se.exec_start = rq_clock_task(rq);
	cpuacct_charge(curr, delta_exec);

	sched_rt_avg_update(rq, delta_exec);

	dl_se->runtime -= delta_exec;

throttle:
    //如果限期进程用完运行时间或主动让出处理器
	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
		dl_se->dl_throttled = 1;  //节流标志
        //把当前进程从限期运行队列中删除
		__dequeue_task_dl(rq, curr, 0);
        //如果当前进程被临时提升为限期进程，即该进程占用了某个限期进程的实时互斥锁
        //或绝对截止期限过期
		if (unlikely(dl_se->dl_boosted || !start_dl_timer(curr)))
			enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);

        //如果当前进程不在截止期限运行队列中或截止期限不是最小
		if (!is_leftmost(curr, &rq->dl))
			resched_curr(rq);
	}

	/*
	 * Because -- for now -- we share the rt bandwidth, we need to
	 * account our runtime there too, otherwise actual rt tasks
	 * would be able to exceed the shared quota.
	 *
	 * Account to the root rt group for now.
	 *
	 * The solution we're working towards is having the RT groups scheduled
	 * using deadline servers -- however there's a few nasties to figure
	 * out before that can happen.
	 */
	if (rt_bandwidth_enabled()) {
		struct rt_rq *rt_rq = &rq->rt;

		raw_spin_lock(&rt_rq->rt_runtime_lock);
		/*
		 * We'll let actual RT tasks worry about the overflow here, we
		 * have our own CBS to keep us inline; only account when RT
		 * bandwidth is relevant.
		 */
		if (sched_rt_bandwidth_account(rt_rq))
			rt_rq->rt_time += delta_exec;
		raw_spin_unlock(&rt_rq->rt_runtime_lock);
	}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

实时调度类的处理器负载均衡

实时调度类的处理器负载均衡和限期调度类相似。调度器选择下一个实时进程时，如果当前处理器的实时运行队列中的进程的最高调度优先级比当前正在执行的进程的调度优先级低，将会试图从实时进程超载的处理器把可推送实时进程拉过来。
实时进程超载的定义：

实时运行队列至少有两个实时进程。
至少有一个可推送实时进程。

static void pull_rt_task(struct rq *this_rq)
{
	int this_cpu = this_rq->cpu, cpu;
	bool resched = false;
	struct task_struct *p;
	struct rq *src_rq;

	if (likely(!rt_overloaded(this_rq)))
		return;

	/*
	 * Match the barrier from rt_set_overloaded; this guarantees that if we
	 * see overloaded we must also see the rto_mask bit.
	 */
	smp_rmb();

#ifdef HAVE_RT_PUSH_IPI
	if (sched_feat(RT_PUSH_IPI)) {
		tell_cpu_to_push(this_rq);
		return;
	}
#endif

	for_each_cpu(cpu, this_rq->rd->rto_mask) {
		if (this_cpu == cpu)
			continue;

		src_rq = cpu_rq(cpu);

		/*
		 * Don't bother taking the src_rq->lock if the next highest
		 * task is known to be lower-priority than our current task.
		 * This may look racy, but if this value is about to go
		 * logically higher, the src_rq will push this task away.
		 * And if its going logically lower, we do not care
		 */
		if (src_rq->rt.highest_prio.next >=
		    this_rq->rt.highest_prio.curr)
			continue;

		/*
		 * We can potentially drop this_rq's lock in
		 * double_lock_balance, and another CPU could
		 * alter this_rq
		 */
		double_lock_balance(this_rq, src_rq);

		/*
		 * We can pull only a task, which is pushable
		 * on its rq, and no others.
		 */
		p = pick_highest_pushable_task(src_rq, this_cpu);

		/*
		 * Do we have an RT task that preempts
		 * the to-be-scheduled task?
		 */
		if (p && (p->prio < this_rq->rt.highest_prio.curr)) {
			WARN_ON(p == src_rq->curr);
			WARN_ON(!task_on_rq_queued(p));

			/*
			 * There's a chance that p is higher in priority
			 * than what's currently running on its cpu.
			 * This is just that p is wakeing up and hasn't
			 * had a chance to schedule. We only pull
			 * p if it is lower in priority than the
			 * current task on the run queue
			 */
			if (p->prio < src_rq->curr->prio)
				goto skip;

			resched = true;

			deactivate_task(src_rq, p, 0);
			set_task_cpu(p, this_cpu);
			activate_task(this_rq, p, 0);
			/*
			 * We continue with the search, just in
			 * case there's an even higher prio task
			 * in another runqueue. (low likelihood
			 * but possible)
			 */
		}
skip:
		double_unlock_balance(this_rq, src_rq);
	}

	if (resched)
		resched_curr(this_rq);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91

公平调度类的处理器负载均衡

目前多处理器系统有两种体系结构：NUMA和SMP。
处理器内部的拓扑如下：

核(core)：一个处理器包含多个核，每个核独立的一级缓存，所有核共享二级缓存。
硬件线程：也称为逻辑处理器或者虚拟处理器，一个处理器或者核包含多个硬件线程，硬件线程共享一级缓存和二级缓存。MIPS处理器的叫法是同步多线程(Simultaneous Multi-Threading,SMT)，英特尔对它的叫法是超线程。

写在最后

不知不觉写了不少，特别是贴了不少代码，阅读可能需要耐心。
最后还要再提几个概念

运行队列

没有必要的代码就不贴了。

struct rq {
...
    struct cfs_rq cfs;
    struct rt_rq rt;
    struct dl_rq dl;
...
    struct task_struct *curr, *idle, *stop;
...
};
1
2
3
4
5
6
7
8
9

struct rq中嵌入公平队列cfs、实时运行队列rt、限期运行队列dl，但是停机调度和空闲调度没有队列。这是因为停机调度类和空闲调度类在每个处理器上只有一个内核线程，不需要运行队列，直接定义成员stop和idle指向迁移线程和空闲线程即可。

内核态和用户态

由于需要限制不同程序的访问能力，所以有了内核态和用户态。

内核态：cpu可以访问内存中所有的数据，包括外围设备（网卡、硬盘等），cpu也可以将自己从一个程序切换到另一个程序。
用户态：只能受限访问内存中的用户空间，并且不能访问外围设备。占用cpu的能力被剥夺，cpu资源可以被其他程序获取。

用户态切换到内核态，一般是发生了系统调用或是发生了异常（如缺页异常、外设产生中断）。

相关阅读:
【已解决】Operation timed out 问题
 【Echarts】自定义提示框tooltip样式，实现点击路由跳转
 【自然语言实战】机器学习之基于评论内容的主题分类模型
 JUC-ReentrantLock锁基础篇
 如何做好电子内窥镜的网络安全管理？
释放C盘空间：WinSXS文件夹真实性大小判断及释放占用空间
 竞赛深度学习YOLO抽烟行为检测 - python opencv
前端从零到一开发vscode插件并发布到插件市场
 李超线段树
 测试经验总结
原文地址：https://blog.csdn.net/m0_65931372/article/details/126081908