smp_processor_id在内核中大量使用。这个接口用于获取当前代码(就是调用该接口所在位置的代码)当前在哪个逻辑CPU上运行。
这篇文章探讨这样一个问题,就是上述接口是如何实现的。
我第一次了解到这个接口时,以为是通过读取CPU内部的特定寄存器或者通过特定指令实现的。为了验证这个想法,就简单看了下代码,发现在ARM32下,是通过当前内核线程信息结构的cpu字段来获取的。如下:
- ./include/linux/smp.h:
-
- # define smp_processor_id() raw_smp_processor_id()
-
- //smp_processor_id是raw_smp_processor_id的别名
-
- //而raw_processor_id则实现为如下接口
-
- ./arch/arm/include/asm/smp.h:
-
- #define raw_smp_processor_id() (current_thread_info()->cpu)
看到这个结果,我有点纳闷,怎么可以通过这个字段就可以知道当前自己在哪个逻辑CPU上呢?那这个疑问的答案就在这个变量啥时候赋值了。为了这个赋值,曾开启了一段艰苦的探索之旅。
其实,要追踪这个赋值,我们需要清楚CPU的执行上下文都有哪些。
对于任何一个CPU核心,当前执行的代码,可能属于:
1 用户代码,属于某一个用户进程或线程,至于说进程还是线程,关系不大。
2 用户切换到内核过程,比如系统调用
3 内核逻辑执行,要么属于用户进程,要么属于独立的内核线程
4 中断,要么从内核态切换,要么从用户态切换
5 异常,要么内核态触发,要么用户态触发
6 信号处理。从内核返回用户空间时,如果有待处理的信号,那么内核需要让用户代码恢复执行前先执行信号注册的部分。
可以看到,无论在那种情况下,当前执行的代码都是有主的,也就是有上下文的,要么是用户线程,要么是内核线程,要么是各种异常,总之是有来源的。这是非常关键的信息。
如果CPU核心开始运行的时候,知道自己是在哪个核心上,那么从最初的idle任务开始,核心就可以传递这个编号信息。
首先,idle任务通过人为设置,确定编号,就是创世之源。
之后,从idle开始,无论是切换到上述6种中的哪一种场景,如果我们都能够在切换时传递编号,就像是现实世界的代代相传
那么,任何时候,代码要知道自己被执行的CPU,就只需要查找自己处在什么上下文中,然后看看这个上下文的CPU编号是什么,就最终能够确定自己在那个核心上运行。
整个流程就是,当前代码调用接口获取CPU编号--接口查找当前上下文编号--当前上下文编号来自切换者传递--切换者来自更早的切换者--最初的idle由人为设定
只要保证上述的调用链条正确,那么查找结果就是正确的。
那么问题来了,编号的本源来自哪里,就是最初给第一个内核态的idle线程栈赋值编号的地方在哪里?
首先,上电启动的核心一般都是编号为0的核心,这个是CPU设计时候决定的,所以整个内核的初始化,基本都是由编号为0的核心完成的。
其他核心在核心0完成必要的初始化工作后,唤醒,开始接任务,共同分担系统负载。
所以,其他核心启动前,运行代码的核心编号是0,这个好确定。待其他核心启动后,如何知道各个核心的编号?
我们先来看看内核启动的日志
- Booting Linux on physical CPU 0x0
- Initializing cgroup subsys cpu
- Linux version 3.18.20 (root@ubuntu) (gcc version 4.9.4 20150629 (prerelease) (Hisilicon_v600_20180525) ) #19 SMP Wed May 29 19:23:24 PDT 2019
- CPU: ARMv7 Processor [410fc075] revision 5 (ARMv7), cr=10c5387d
- CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
- Machine model: Hisilicon HI3536C DEMO Board
- Memory policy: Data cache writealloc
- On node 0 totalpages: 65536
- free_area_init_node: node 0, pgdat c074ee00, node_mem_map cfdf7000
- Normal zone: 512 pages used for memmap
- Normal zone: 0 pages reserved
- Normal zone: 65536 pages, LIFO batch:15
- PERCPU: Embedded 9 pages/cpu @cfddc000 s6144 r8192 d22528 u36864
- pcpu-alloc: s6144 r8192 d22528 u36864 alloc=9*4096
- pcpu-alloc: [0] 0 [0] 1
- Built 1 zonelists in Zone order, mobility grouping on. Total pages: 65024
- Kernel command line: mem=256M console=ttyAMA0,115200 root=/dev/mtdblock2 rootfstype=jffs2 rw mtdparts=hi_sfc:1M(boot),4M(kernel),27M(rootfs)
- PID hash table entries: 1024 (order: 0, 4096 bytes)
- Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
- Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
- Memory: 251976K/262144K available (5483K kernel code, 230K rwdata, 1496K rodata, 240K init, 297K bss, 10168K reserved, 0K highmem)
- Virtual kernel memory layout:
- vector : 0xffff0000 - 0xffff1000 ( 4 kB)
- fixmap : 0xffc00000 - 0xffe00000 (2048 kB)
- vmalloc : 0xd0800000 - 0xff000000 ( 744 MB)
- lowmem : 0xc0000000 - 0xd0000000 ( 256 MB)
- pkmap : 0xbfe00000 - 0xc0000000 ( 2 MB)
- modules : 0xbf000000 - 0xbfe00000 ( 14 MB)
- .text : 0xc0008000 - 0xc06d9070 (6981 kB)
- .init : 0xc06da000 - 0xc0716000 ( 240 kB)
- .data : 0xc0716000 - 0xc074fb40 ( 231 kB)
- .bss : 0xc074fb40 - 0xc079a1fc ( 298 kB)
- SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
- Hierarchical RCU implementation.
- RCU restricting CPUs from NR_CPUS=4 to nr_cpu_ids=2.
- RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=2
- NR_IRQS:16 nr_irqs:16 16
- sched_clock: 32 bits at 3000kHz, resolution 333ns, wraps every 1431655765682ns
- Console: colour dummy device 80x30
- Calibrating delay loop... 2580.48 BogoMIPS (lpj=1290240)
- pid_max: default: 32768 minimum: 301
- Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
- Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
- CPU: Testing write buffer coherency: ok
- CPU0: thread -1, cpu 0, socket 0, mpidr 80000000
- Setting up static identity map for 0x8053c180 - 0x8053c1d8
- CPU1: Booted secondary processor
- CPU1: thread -1, cpu 1, socket 0, mpidr 80000001
- Brought up 2 CPUs
- SMP: Total of 2 processors activated (5173.24 BogoMIPS).
- CPU: All CPU(s) started in SVC mode.
- devtmpfs: initialized
- VFP support v0.3: implementor 41 architecture 2 part 30 variant 7 rev 5
可以看到,SMP其他核心启动在驱动之前,在核心0完成必要的初始化boot操作后。
smp_init中,这个CPU编号是自己生成,等到之后的代码,系统都开始采用smp_processor_id获取编号了。
- /* Called by boot processor to activate the rest. */
- void __init smp_init(void)
- {
- unsigned int cpu;
-
- idle_threads_init();
-
- /* FIXME: This should be done in userspace --RR */
- for_each_present_cpu(cpu) {
- if (num_online_cpus() >= setup_max_cpus)
- break;
- if (!cpu_online(cpu))
- cpu_up(cpu);
- }
-
- /* Any cleanup work */
- smp_announce();
- smp_cpus_done(setup_max_cpus);
- }
这时候已经可以拿取编号了
- /**
- * idle_threads_init - Initialize idle threads for all cpus
- */
- void __init idle_threads_init(void)
- {
- unsigned int cpu, boot_cpu;
-
- boot_cpu = smp_processor_id();
-
- for_each_possible_cpu(cpu) {
- if (cpu != boot_cpu)
- idle_init(cpu);
- }
- }
但这里是boot CPU,所以是零,其他CPU,需要进一步看idle_init
- /**
- * idle_init - Initialize the idle thread for a cpu
- * @cpu: The cpu for which the idle thread should be initialized
- *
- * Creates the thread if it does not exist.
- */
- static inline void idle_init(unsigned int cpu)
- {
- struct task_struct *tsk = per_cpu(idle_threads, cpu);
-
- if (!tsk) {
- tsk = fork_idle(cpu);
- if (IS_ERR(tsk))
- pr_err("SMP: fork_idle() failed for CPU %u\n", cpu);
- else
- per_cpu(idle_threads, cpu) = tsk;
- }
- }
-
-
- struct task_struct *fork_idle(int cpu)
- {
- struct task_struct *task;
- task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0);
- if (!IS_ERR(task)) {
- init_idle_pids(task->pids);
- init_idle(task, cpu);
- }
-
- return task;
- }
-
- /**
- * init_idle - set up an idle thread for a given CPU
- * @idle: task in question
- * @cpu: cpu the idle task belongs to
- *
- * NOTE: this function does not set the idle thread's NEED_RESCHED
- * flag, to make booting more robust.
- */
- void init_idle(struct task_struct *idle, int cpu)
- {
- struct rq *rq = cpu_rq(cpu);
- unsigned long flags;
-
- raw_spin_lock_irqsave(&rq->lock, flags);
-
- __sched_fork(0, idle);
- idle->state = TASK_RUNNING;
- idle->se.exec_start = sched_clock();
-
- do_set_cpus_allowed(idle, cpumask_of(cpu));
- /*
- * We're having a chicken and egg problem, even though we are
- * holding rq->lock, the cpu isn't yet set to this cpu so the
- * lockdep check in task_group() will fail.
- *
- * Similar case to sched_fork(). / Alternatively we could
- * use task_rq_lock() here and obtain the other rq->lock.
- *
- * Silence PROVE_RCU
- */
- rcu_read_lock();
- __set_task_cpu(idle, cpu);
- rcu_read_unlock();
-
- rq->curr = rq->idle = idle;
- idle->on_rq = TASK_ON_RQ_QUEUED;
- #if defined(CONFIG_SMP)
- idle->on_cpu = 1;
- #endif
- raw_spin_unlock_irqrestore(&rq->lock, flags);
-
- /* Set the preempt count _outside_ the spinlocks! */
- init_idle_preempt_count(idle, cpu);
-
- /*
- * The idle tasks have their own, simple scheduling class:
- */
- idle->sched_class = &idle_sched_class;
- ftrace_graph_init_idle_task(idle, cpu);
- vtime_init_idle(idle, cpu);
- #if defined(CONFIG_SMP)
- sprintf(idle->comm, "%s/%d", INIT_TASK_COMM, cpu);
- #endif
- }
-
-
- static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
- {
- set_task_rq(p, cpu);
- #ifdef CONFIG_SMP
- /*
- * After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
- * successfuly executed on another CPU. We must ensure that updates of
- * per-task data have been completed by this moment.
- */
- smp_wmb();
- task_thread_info(p)->cpu = cpu;
- p->wake_cpu = cpu;
- #endif
- }
我们注意到,最后这里,task_thread_info(p)->cpu = cpu;
这里函数调用过程都传递了参数CPU,也就是说,对SMP中的每一个其他逻辑CPU,创建了idle task,并将task的CPU设置为参数传递的编号。注意,这里,其他CPU还没有运行,这些编号是预置的,也就是编号设置为1的idle,到时候是需要CPU编号为1的核心来执行的。这一点又是如何做到的。
这里先把前面的流程补充上。idle的创建来自fork
- static noinline void __init_refok rest_init(void)
- {
- int pid;
-
- rcu_scheduler_starting();
- /*
- * We need to spawn init first so that it obtains pid 1, however
- * the init task will end up wanting to create kthreads, which, if
- * we schedule it before we create kthreadd, will OOPS.
- */
- kernel_thread(kernel_init, NULL, CLONE_FS);
- numa_default_policy();
- pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
- rcu_read_lock();
- kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
- rcu_read_unlock();
- complete(&kthreadd_done);
-
- /*
- * The boot idle thread must execute schedule()
- * at least once to get things moving:
- */
- init_idle_bootup_task(current);
- schedule_preempt_disabled();
- /* Call into cpu_idle with preempt disabled */
- cpu_startup_entry(CPUHP_ONLINE);
- }
注意到,kernel_init是在内核线程中执行的,此时是boot CPU执行线程,从idle线程fork而来
- /*
- * Create a kernel thread.
- */
- pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
- {
- return do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
- (unsigned long)arg, NULL, NULL);
- }
合起来,整个调用栈是
rest_init
kernel_init
kernel_init_freeable
smp_init
idle_threads_init
idle_init
fork_idle
init_idle
__set_task_cpu
到目前,我们补充了完整的CPU编号设置路径。
现在再次回到上面的问题,如何让编号为 i 核心的CPU执行编号为 i 的idle线程?
1 上电后,boot CPU 也就是CPU0先运行,其他CPU处于等待状态,使用wfi指令,arm平台
2 CPU0完成准备工作后,通知其他CPU工作,这是通过核间中断发送的,sev指令,arm平台
3 其他CPU接收到事件后,就可以从配置的地址运行代码了。关键就在这里。
4 其他核心运行的代码最终都会关联到核心自己的idle任务,无论是通过直接写内存地址还是跳转方式。
5 这样,主CPU就可以将从CPU的idle任务的CPU编号(上面堆栈记录)和从CPU的启动地址对应起来,也就是 i 号CPU的启动地址在其 i 偏移地址处。
最终,从CPU到自己的idle任务代码段去执行,每个从CPU的代码段不一样,这个是主CPU配置的,自然主CPU得知道自己配置的是那个从CPU的代码段,那么也就知道配置那个编号。
- Thread 1 hit Breakpoint 3, psci_boot_secondary (cpu=1, idle=0xdb09c800) at arch/arm/kernel/psci_smp.c:54
- 54 if (psci_ops.cpu_on)
- (gdb) bt
- #0 psci_boot_secondary (cpu=1, idle=0xdb09c800) at arch/arm/kernel/psci_smp.c:54
- #1 0xc030fde4 in __cpu_up (cpu=1, idle=0xdb09c800) at arch/arm/kernel/smp.c:163
- #2 0xc0349478 in bringup_cpu (cpu=1) at kernel/cpu.c:530
- #3 0xc0349ae8 in cpuhp_invoke_callback (cpu=1, state=CPUHP_BRINGUP_CPU, bringup=<optimized out>, node=<optimized out>, lastp=0x0) at kernel/cpu.c:170
- #4 0xc034ae8c in cpuhp_up_callbacks (target=<optimized out>, st=<optimized out>, cpu=<optimized out>) at kernel/cpu.c:584
- #5 _cpu_up (cpu=1, tasks_frozen=<optimized out>, target=<optimized out>) at kernel/cpu.c:1192
- #6 0xc034afa4 in do_cpu_up (cpu=1, target=CPUHP_ONLINE) at kernel/cpu.c:1228
- #7 0xc034afbc in cpu_up (cpu=<optimized out>) at kernel/cpu.c:1236
- #8 0xc14252ac in smp_init () at kernel/smp.c:578
- #9 0xc14011a0 in kernel_init_freeable () at init/main.c:1140
- #10 0xc0e7be2c in kernel_init (unused=<optimized out>) at init/main.c:1064
- #11 0xc03010e8 in ret_from_fork () at arch/arm/kernel/entry-common.S:158
- Backtrace stopped: previous frame identical to this frame (corrupt stack?)
-
- (gdb) bt
- #0 psci_cpu_on (cpuid=2, entry_point=1076896960) at drivers/firmware/psci.c:194
- #1 0xc0316430 in psci_boot_secondary (cpu=<optimized out>, idle=<optimized out>) at ./arch/arm/include/asm/memory.h:323
- #2 0xc030fde4 in __cpu_up (cpu=2, idle=0xdb09ce00) at arch/arm/kernel/smp.c:163
- #3 0xc0349478 in bringup_cpu (cpu=2) at kernel/cpu.c:530
- #4 0xc0349ae8 in cpuhp_invoke_callback (cpu=2, state=CPUHP_BRINGUP_CPU, bringup=<optimized out>, node=<optimized out>, lastp=0x0) at kernel/cpu.c:170
- #5 0xc034ae8c in cpuhp_up_callbacks (target=<optimized out>, st=<optimized out>, cpu=<optimized out>) at kernel/cpu.c:584
- #6 _cpu_up (cpu=2, tasks_frozen=<optimized out>, target=<optimized out>) at kernel/cpu.c:1192
- #7 0xc034afa4 in do_cpu_up (cpu=2, target=CPUHP_ONLINE) at kernel/cpu.c:1228
- #8 0xc034afbc in cpu_up (cpu=<optimized out>) at kernel/cpu.c:1236
- #9 0xc14252ac in smp_init () at kernel/smp.c:578
- #10 0xc14011a0 in kernel_init_freeable () at init/main.c:1140
- #11 0xc0e7be2c in kernel_init (unused=<optimized out>) at init/main.c:1064
- #12 0xc03010e8 in ret_from_fork () at arch/arm/kernel/entry-common.S:158
- Backtrace stopped: previous frame identical to this frame (corrupt stack?)
-
-
以上是qemu模拟里面,启动从CPU的部分堆栈。(图片来自arm官方)