• Linux CPU之mpstat


    前言

    NAME
           mpstat - Report processors related statistics.
    
    • 1
    • 2

    vmstat用来观测系统整体的性能情况,pidstat来观测单个进程的性能情况,那么mpstat是用来观测单个CPU的性能情况。

    一、mpstat 简介

    mpstat命令输出每个可用处理器的标准输出活动,输出显示中cpu0是第一个处理器。还报告了所有处理器的全局的平均活动。mpstat命令可以在SMP和UP机器上使用,但在后者(UP机器上)中,只打印全局平均活动。如果未指定参数,则默认报告为CPU整体利用率报告。

    备注:
    UP(Uni-Processor):系统只有一个处理器单元,即单核CPU系统。
    SMP(Symmetric Multi-Processors):系统有多个处理器单元。各个处理器之间共享总线,内存等等

    [root@localhost ~]# mpstat
    Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)
    
    02:48:09 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
    02:48:09 PM  all    0.31    0.00    0.28    0.01    0.00    0.00    0.00    0.00    0.00   99.40
    
    • 1
    • 2
    • 3
    • 4
    • 5
     mpstat ...... [ interval [ count ] ]
    
    • 1

    interval参数指定每个报告之间的时间量(以秒为单位)。值为0(或根本没有参数)表示自系统启动(引导)以来将报告处理器统计信息。如果没有将interval参数设置为零,则可以将count参数与interval参数一起指定。count的值决定了间隔几秒生成的报告的数量。如果指定interval参数而不指定count参数,则mpstat命令将连续生成报表。

    [root@localhost ~]# mpstat 2 5
    Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)
    
    02:55:10 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
    02:55:12 PM  all    0.50    0.00    0.50    0.00    0.00    0.00    0.00    0.00    0.00   99.00
    02:55:14 PM  all    0.25    0.00    0.63    0.00    0.00    0.00    0.00    0.00    0.00   99.12
    02:55:16 PM  all    0.38    0.00    0.50    0.00    0.00    0.00    0.00    0.00    0.00   99.12
    02:55:18 PM  all    0.50    0.00    0.62    0.00    0.00    0.00    0.00    0.00    0.00   98.88
    02:55:20 PM  all    0.38    0.00    0.50    0.00    0.00    0.00    0.00    0.00    0.00   99.12
    Average:     all    0.40    0.00    0.55    0.00    0.00    0.00    0.00    0.00    0.00   99.05
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    以两秒间隔显示所有处理器之间的五个全局统计数据报告。

    就是读取 /proc/stat 文件中的数据:

    open("/proc/stat", O_RDONLY)            = 3
    
    • 1

    二、mpstat -P

     -P { cpu [,...] | ON | ALL }
    
    • 1

    指示要报告统计信息的处理器编号,cpu是处理器编号,处理器0是第一个处理器。ON关键字表示要为每个在线处理器报告统计信息,而ALL关键字表示要报告所有处理器的统计信息。

    输出处理器1(第二个处理器)的报告统计信息:

    [root@localhost ~]# mpstat -P 1
    Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)
    
    03:08:34 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
    03:08:34 PM    1    0.28    0.00    0.29    0.00    0.00    0.00    0.00    0.00    0.00   99.43
    
    • 1
    • 2
    • 3
    • 4
    • 5

    输出每个在线处理器报告统计信息:

    [root@localhost ~]# mpstat -P ON
    Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)
    
    03:09:59 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
    03:09:59 PM  all    0.31    0.00    0.28    0.01    0.00    0.00    0.00    0.00    0.00   99.40
    03:09:59 PM    0    0.31    0.00    0.27    0.00    0.00    0.00    0.00    0.00    0.00   99.41
    03:09:59 PM    1    0.28    0.00    0.29    0.00    0.00    0.00    0.00    0.00    0.00   99.43
    03:09:59 PM    2    0.33    0.00    0.29    0.00    0.00    0.00    0.00    0.00    0.00   99.38
    03:09:59 PM    3    0.34    0.00    0.28    0.01    0.00    0.00    0.00    0.00    0.00   99.37
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    输出所有处理器的统计信息:

    [root@localhost ~]# mpstat -P ALL
    Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)
    
    03:11:24 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
    03:11:24 PM  all    0.31    0.00    0.28    0.01    0.00    0.00    0.00    0.00    0.00   99.40
    03:11:24 PM    0    0.31    0.00    0.27    0.00    0.00    0.00    0.00    0.00    0.00   99.41
    03:11:24 PM    1    0.28    0.00    0.29    0.00    0.00    0.00    0.00    0.00    0.00   99.43
    03:11:24 PM    2    0.33    0.00    0.29    0.00    0.00    0.00    0.00    0.00    0.00   99.38
    03:11:24 PM    3    0.34    0.00    0.28    0.01    0.00    0.00    0.00    0.00    0.00   99.37
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    每个字段的含义:

    CPU
           Processor number. The keyword all indicates that statistics are calculated as averages among all processors.
    
    %usr
           Show the percentage of CPU utilization that occurred while executing at the user level (application).
    
    %nice
           Show the percentage of CPU utilization that occurred while executing at the user level with nice priority.
    
    %sys
           Show the percentage of CPU utilization that occurred while executing at the system level (kernel). Note that this does not include time spent servicing hardware and  soft‐
           ware interrupts.
    
    %iowait
           Show the percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
    
    %irq
           Show the percentage of time spent by the CPU or CPUs to service hardware interrupts.
    
    %soft
           Show the percentage of time spent by the CPU or CPUs to service software interrupts.
    
    %steal
           Show the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
    
    %guest
           Show the percentage of time spent by the CPU or CPUs to run a virtual processor.
    
    %gnice
           Show the percentage of time spent by the CPU or CPUs to run a niced guest.
    
    %idle
           Show the percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33

    具体请参考:Linux top命令的cpu使用率和内存使用率 这篇文章中关于各个字段的释义。

    三、mpstat -I

    -I { SUM | CPU | SCPU | ALL }
    
    • 1

    报告中断统计信息

    3.1 mpstat -I SUM

    使用SUM关键字,mpstat命令报告每个处理器的中断总数。将显示以下值:
    CPU:处理器编号,关键字all表示统计数据是以所有处理器的平均值计算的。
    intr/s:显示CPU每秒接收的中断总数。

    [root@localhost ~]# mpstat -I SUM
    Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)
    
    03:21:23 PM  CPU    intr/s
    03:21:23 PM  all     93.37
    
    • 1
    • 2
    • 3
    • 4
    • 5

    3.2 mpstat -I CPU

    3.2.1 数据来源

    使用CPU关键字,将显示 CPU or CPUs 每秒接收到的每个中断(硬中断)的数量。
    硬中断是硬件中断处理程序,在Linux 中称为上半部分,优先级最高,硬件中断处理程序处理过程中会屏蔽其它中断。

    [root@localhost ~]# mpstat -I CPU
    Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)
    
    03:23:41 PM  CPU        0/s        1/s        8/s        9/s       12/s       16/s       20/s      120/s      121/s      122/s      123/s      124/s      125/s      126/s      127/s      NMI/s      LOC/s      SPU/s      PMI/s      IWI/s      RTR/s      RES/s      CAL/s      TLB/s      TRM/s      THR/s      DFR/s      MCE/s      MCP/s      ERR/s      MIS/s      PIN/s      NPI/s      PIW/s
    03:23:41 PM    0       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.02       4.03       0.00       0.00       0.00      20.01       0.00       0.00       0.19       0.00       0.21       0.00       0.11       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
    03:23:41 PM    1       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00      21.62       0.00       0.00       0.21       0.00       0.15       0.00       0.14       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
    03:23:41 PM    2       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00      22.75       0.00       0.00       0.17       0.00       0.15       0.00       0.07       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
    03:23:41 PM    3       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.20       0.00       0.00       0.00       0.00      22.86       0.00       0.00       0.25       0.00       0.15       0.00       0.08       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00       0.00
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    数据来源就是读取 /proc/interrupts 文件,/proc/interrupts 提供了硬中断的运行情况:
    备注:中断本质上是一种特殊的电信号,由硬件设备发向处理器。处理器接受到中断后,会马上向操作系统反映中断信号的到来,然后由操作系统负责处理这些新到来的数据。硬件设备生成中断的时候不考虑与处理器的时钟同步,即中断随时可以产生。
    中断其实是一种异步的事件处理机制,可以提高系统的并发处理能力。
    由于中断处理程序会打断其他进程的运行,所以,为了减少对正常进程运行调度的影响,中断处理程序就需要尽可能快地运行

    open("/proc/interrupts", O_RDONLY)      = 3
    
    • 1
    [root@localhost ~]# cat /proc/interrupts
                CPU0       CPU1       CPU2       CPU3
       0:         55          0          0          0  IR-IO-APIC-edge      timer
       1:          4          0          0          0  IR-IO-APIC-edge      i8042
       8:          1          0          0          0  IR-IO-APIC-edge      rtc0
       9:          4          0          0          0  IR-IO-APIC-fasteoi   acpi
      12:          3          3          0          0  IR-IO-APIC-edge      i8042
      16:          0          0          0          0  IR-IO-APIC-fasteoi   i801_smbus
      20:          0          0          0          0  IR-IO-APIC-fasteoi   idma64.0
     120:          0          0          0          0  DMAR_MSI-edge      dmar0
     121:          0          0          0          0  DMAR_MSI-edge      dmar1
     122:          0          0          0          0  IR-PCI-MSI-edge      aerdrv, PCIe PME
     123:        148         16         10          2  IR-PCI-MSI-edge      xhci_hcd
     124:       4738        519        421      54943  IR-PCI-MSI-edge      0000:00:17.0
     125:    1111752          0          0          0  IR-PCI-MSI-edge      enp1s0
     126:         38          1        109          9  IR-PCI-MSI-edge      i915
     127:        541        136        191         85  IR-PCI-MSI-edge      snd_hda_intel:card0
     NMI:         56         52         55         56   Non-maskable interrupts
     LOC:    5504316    5950291    6263079    6292876   Local timer interrupts
     SPU:          0          0          0          0   Spurious interrupts
     PMI:         56         52         55         56   Performance monitoring interrupts
     IWI:      52427      58892      47990      68017   IRQ work interrupts
     RTR:          0          0          0          0   APIC ICR read retries
     RES:      56937      40801      42634      41527   Rescheduling interrupts
     CAL:       1150       1147       1194       1149   Function call interrupts
     TLB:      28982      39043      19609      20986   TLB shootdowns
     TRM:          0          0          0          0   Thermal event interrupts
     THR:          0          0          0          0   Threshold APIC interrupts
     DFR:          0          0          0          0   Deferred Error APIC interrupts
     MCE:          0          0          0          0   Machine check exceptions
     MCP:        918        918        918        918   Machine check polls
     ERR:          0
     MIS:          0
     PIN:          0          0          0          0   Posted-interrupt notification event
     NPI:          0          0          0          0   Nested posted-interrupt event
     PIW:          0          0          0          0   Posted-interrupt wakeup event
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36

    其中的一些字段:
    NMI(Non-maskable interrupts):在这种情况下,NMI会递增,因为每个定时器中断都会生成一个NMI(非屏蔽中断),NMI看门狗使用它来检测锁定。
    LOC:LOC是每个CPU的内部APIC的 the local interrupt counter。
    SPU:a spurious interrupt 是在APIC完全处理之前由某个IO设备引发然后降低的某个中断。因此,APIC看到这种中断,但不知道它来自哪个设备。在这种情况下,APIC将生成IRQ向量为0xff的中断。这也可能是芯片组错误造成的。
    RES(Rescheduling interrupts)、CAL(Function call interrupts)、TLB(TLB shootdowns):根据OS的需要从一个CPU向另一个CPU发送重新调度、调用和TLB刷新中断。通常,内核开发人员和感兴趣的用户使用它们的统计信息来确定给定类型中断的发生。
    TRM( Thermal event interrupts):当超过CPU的温度阈值时,发生热事件中断。当温度降至正常值时,也可能会产生该中断。
    THR(Threshold APIC interrupts):当机器检查阈值计数器(通常计数内存或缓存的ECC纠正错误)超过可配置阈值时引发的中断。仅在某些系统上可用。

    3.2.2 内核源码解析

    // linux-3.10/fs/proc/interrupts.c
    
    /*
     * /proc/interrupts
     */
    static void *int_seq_start(struct seq_file *f, loff_t *pos)
    {
    	return (*pos <= nr_irqs) ? pos : NULL;
    }
    
    static void *int_seq_next(struct seq_file *f, void *v, loff_t *pos)
    {
    	(*pos)++;
    	if (*pos > nr_irqs)
    		return NULL;
    	return pos;
    }
    
    static void int_seq_stop(struct seq_file *f, void *v)
    {
    	/* Nothing to do */
    }
    
    static const struct seq_operations int_seq_ops = {
    	.start = int_seq_start,
    	.next  = int_seq_next,
    	.stop  = int_seq_stop,
    	.show  = show_interrupts
    };
    
    static int interrupts_open(struct inode *inode, struct file *filp)
    {
    	return seq_open(filp, &int_seq_ops);
    }
    
    static const struct file_operations proc_interrupts_operations = {
    	.open		= interrupts_open,
    	.read		= seq_read,
    	.llseek		= seq_lseek,
    	.release	= seq_release,
    };
    
    static int __init proc_interrupts_init(void)
    {
    	proc_create("interrupts", 0, NULL, &proc_interrupts_operations);
    	return 0;
    }
    module_init(proc_interrupts_init);
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48

    其中show_interrupts函数:

    // linux-3.10/kernel/irq/proc.c
    
    int show_interrupts(struct seq_file *p, void *v)
    {
    	......
    	arch_show_interrupts(p, prec);
    	......
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    arch_show_interrupts是一个与架构有关的函数,对于x86架构:

    // linux-3.10/arch/x86/kernel/irq.c
    
    #define irq_stats(x)		(&per_cpu(irq_stat, x))
    /*
     * /proc/interrupts printing for arch specific interrupts
     */
    int arch_show_interrupts(struct seq_file *p, int prec)
    {
    	int j;
    
    	seq_printf(p, "%*s: ", prec, "NMI");
    	for_each_online_cpu(j)
    		seq_printf(p, "%10u ", irq_stats(j)->__nmi_count);
    	seq_printf(p, "  Non-maskable interrupts\n");
    #ifdef CONFIG_X86_LOCAL_APIC
    	seq_printf(p, "%*s: ", prec, "LOC");
    	for_each_online_cpu(j)
    		seq_printf(p, "%10u ", irq_stats(j)->apic_timer_irqs);
    	seq_printf(p, "  Local timer interrupts\n");
    
    	seq_printf(p, "%*s: ", prec, "SPU");
    	for_each_online_cpu(j)
    		seq_printf(p, "%10u ", irq_stats(j)->irq_spurious_count);
    	seq_printf(p, "  Spurious interrupts\n");
    	seq_printf(p, "%*s: ", prec, "PMI");
    	for_each_online_cpu(j)
    		seq_printf(p, "%10u ", irq_stats(j)->apic_perf_irqs);
    	seq_printf(p, "  Performance monitoring interrupts\n");
    	seq_printf(p, "%*s: ", prec, "IWI");
    	for_each_online_cpu(j)
    		seq_printf(p, "%10u ", irq_stats(j)->apic_irq_work_irqs);
    	seq_printf(p, "  IRQ work interrupts\n");
    	seq_printf(p, "%*s: ", prec, "RTR");
    	for_each_online_cpu(j)
    		seq_printf(p, "%10u ", irq_stats(j)->icr_read_retry_count);
    	seq_printf(p, "  APIC ICR read retries\n");
    #endif
    	if (x86_platform_ipi_callback) {
    		seq_printf(p, "%*s: ", prec, "PLT");
    		for_each_online_cpu(j)
    			seq_printf(p, "%10u ", irq_stats(j)->x86_platform_ipis);
    		seq_printf(p, "  Platform interrupts\n");
    	}
    #ifdef CONFIG_SMP
    	seq_printf(p, "%*s: ", prec, "RES");
    	for_each_online_cpu(j)
    		seq_printf(p, "%10u ", irq_stats(j)->irq_resched_count);
    	seq_printf(p, "  Rescheduling interrupts\n");
    	seq_printf(p, "%*s: ", prec, "CAL");
    	for_each_online_cpu(j)
    		seq_printf(p, "%10u ", irq_stats(j)->irq_call_count -
    					irq_stats(j)->irq_tlb_count);
    	seq_printf(p, "  Function call interrupts\n");
    	seq_printf(p, "%*s: ", prec, "TLB");
    	for_each_online_cpu(j)
    		seq_printf(p, "%10u ", irq_stats(j)->irq_tlb_count);
    	seq_printf(p, "  TLB shootdowns\n");
    #endif
    #ifdef CONFIG_X86_THERMAL_VECTOR
    	seq_printf(p, "%*s: ", prec, "TRM");
    	for_each_online_cpu(j)
    		seq_printf(p, "%10u ", irq_stats(j)->irq_thermal_count);
    	seq_printf(p, "  Thermal event interrupts\n");
    #endif
    #ifdef CONFIG_X86_MCE_THRESHOLD
    	seq_printf(p, "%*s: ", prec, "THR");
    	for_each_online_cpu(j)
    		seq_printf(p, "%10u ", irq_stats(j)->irq_threshold_count);
    	seq_printf(p, "  Threshold APIC interrupts\n");
    #endif
    #ifdef CONFIG_X86_MCE
    	seq_printf(p, "%*s: ", prec, "MCE");
    	for_each_online_cpu(j)
    		seq_printf(p, "%10u ", per_cpu(mce_exception_count, j));
    	seq_printf(p, "  Machine check exceptions\n");
    	seq_printf(p, "%*s: ", prec, "MCP");
    	for_each_online_cpu(j)
    		seq_printf(p, "%10u ", per_cpu(mce_poll_count, j));
    	seq_printf(p, "  Machine check polls\n");
    #endif
    	seq_printf(p, "%*s: %10u\n", prec, "ERR", atomic_read(&irq_err_count));
    #if defined(CONFIG_X86_IO_APIC)
    	seq_printf(p, "%*s: %10u\n", prec, "MIS", atomic_read(&irq_mis_count));
    #endif
    	return 0;
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86

    可以看到主要是从 per-cpu内存区读取相应的数据,关于x86_64 per-cpu相关知识请参考:Linux per-cpu

    // linux-3.10/arch/x86/include/asm/hardirq.h
    
    typedef struct {
    	unsigned int __softirq_pending;
    	unsigned int __nmi_count;	/* arch dependent */
    #ifdef CONFIG_X86_LOCAL_APIC
    	unsigned int apic_timer_irqs;	/* arch dependent */
    	unsigned int irq_spurious_count;
    	unsigned int icr_read_retry_count;
    #endif
    #ifdef CONFIG_HAVE_KVM
    	unsigned int kvm_posted_intr_ipis;
    #endif
    	unsigned int x86_platform_ipis;	/* arch dependent */
    	unsigned int apic_perf_irqs;
    	unsigned int apic_irq_work_irqs;
    #ifdef CONFIG_SMP
    	unsigned int irq_resched_count;
    	unsigned int irq_call_count;
    	/*
    	 * irq_tlb_count is double-counted in irq_call_count, so it must be
    	 * subtracted from irq_call_count when displaying irq_call_count
    	 */
    	unsigned int irq_tlb_count;
    #endif
    #ifdef CONFIG_X86_THERMAL_VECTOR
    	unsigned int irq_thermal_count;
    #endif
    #ifdef CONFIG_X86_MCE_THRESHOLD
    	unsigned int irq_threshold_count;
    #endif
    } ____cacheline_aligned irq_cpustat_t;
    
    DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    // linux-3.10/include/linux/irq_cpustat.h
    
    /*
     * Simple wrappers reducing source bloat.  Define all irq_stat fields
     * here, even ones that are arch dependent.  That way we get common
     * definitions instead of differing sets for each arch.
     */
    
    #ifndef __ARCH_IRQ_STAT
    extern irq_cpustat_t irq_stat[];		/* defined in asm/hardirq.h */
    #define __IRQ_STAT(cpu, member)	(irq_stat[cpu].member)
    #endif
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    3.3 mpstat -I SCPU

    3.3.1 数据来源

    使用SCPU关键字,将显示 CPU or CPUs 每秒接收到的每个软件中断的数量。
    软中断是预留给系统中对时间要求较严格和重要的下半部使用的(上半部是硬件中断处理,优先级最高),软中断执行过程中会响应其它的中断。
    驱动中只有块设备和网络子系统使用了软中断。

    [root@localhost ~]# mpstat -I SCPU
    Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     11/28/2022      _x86_64_        (4 CPU)
    
    04:48:54 PM  CPU       HI/s    TIMER/s   NET_TX/s   NET_RX/s    BLOCK/s BLOCK_IOPOLL/s  TASKLET/s    SCHED/s  HRTIMER/s      RCU/s
    04:48:54 PM    0       0.00      10.62       0.17       4.26       0.02       0.00       0.02       6.63       0.00       3.92
    04:48:54 PM    1       0.00      12.78       0.00       0.03       0.00       0.00       0.00       7.19       0.00       4.89
    04:48:54 PM    2       0.00      12.41       0.00       0.03       0.00       0.00       0.00       7.28       0.00       4.48
    04:48:54 PM    3       0.00      12.97       0.00       0.02       0.19       0.00       0.00       7.13       0.00       4.90
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    数据来源读取 /proc/softirqs 文件,/proc/softirqs 提供了软中断的运行情况:

    open("/proc/softirqs", O_RDONLY)        = 3
    
    • 1
    [root@localhost ~]# cat /proc/softirqs
                        CPU0       CPU1       CPU2       CPU3
              HI:         29         12         81          4
           TIMER:    2978642    3586869    3484711    3642228
          NET_TX:      46707          2          2          1
          NET_RX:    1195259       8070       7563       6755
           BLOCK:       5432        776        578      53783
    BLOCK_IOPOLL:          0          0          0          0
         TASKLET:       5769          0          0          0
           SCHED:    1860352    2017852    2042455    1999179
         HRTIMER:          0          0          0          0
             RCU:    1100586    1372825    1258876    1377782
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    软中断包括了 10 个类别,分别对应不同的工作类型。比如 NET_RX 表示网络接收中断,而 NET_TX 表示网络发送中断。

    参数详解:

    tasklet优先级描述
    HI0优先级最高的软中断
    TIMER1定时器的下半部
    NET_TX2网络发送软中断
    NET_RX3网络接收软中断
    BLOCK4用于块设备的软中断
    BLOCK_IOPOLL5用于块设备的软中断
    TASKLET6用于tasklets机制的软中断
    SCHED7进程调度和负载均衡
    HRTIMER8高分辨率定时器
    RCU9为RCU锁服务的软中断

    优先级高的软中断(比如:0)在优先级低的软中断(比如:9)优先执行。

    当内核中出现大量的软中断时(当软中断比较多时,普通进程优先级低于软中断,那么普通进程无法获得足够多的处理器时间),软中断会以内核线程的方式运行的,每个 CPU 都对应一个软中断内核线程,这个软中断内核线程就叫做 ksoftirqd/CPU 编号。

    [root@localhost ~]# top -n 1 | grep ksoftirqd
        3 root      20   0       0      0      0 S   0.0  0.0   0:00.89 ksoftirqd/0
       14 root      20   0       0      0      0 S   0.0  0.0   0:00.10 ksoftirqd/1
       19 root      20   0       0      0      0 S   0.0  0.0   0:00.13 ksoftirqd/2
       24 root      20   0       0      0      0 S   0.0  0.0   0:00.18 ksoftirqd/3
    
    • 1
    • 2
    • 3
    • 4
    • 5

    或者:

    [root@localhost ~]# ps aux | grep ksoftirq
    root         3  0.0  0.0      0     0 ?        S    Nov25   0:00 [ksoftirqd/0]
    root        14  0.0  0.0      0     0 ?        S    Nov25   0:00 [ksoftirqd/1]
    root        19  0.0  0.0      0     0 ?        S    Nov25   0:00 [ksoftirqd/2]
    root        24  0.0  0.0      0     0 ?        S    Nov25   0:00 [ksoftirqd/3]
    
    • 1
    • 2
    • 3
    • 4
    • 5

    这些线程的名字外面都有中括号,这说明 ps 无法获取它们的命令行参数(cmline)。一般来说,ps 的输出中,名字括在中括号里的,一般都是内核线程。

    注意:软中断的优先级是高于普通进程的,当一个软中断执行的时候,可以重新触发自己以便在得到执行(比如网络子系统),如果软中断出现的频率比较高,再加上软中断又有将自己重新设置为可执行状态的能力,那么就会导致用户态普通进程无法获得足够多的运行时间。
    因此内核线程 ksoftirq 就是在内核中出现大量的软中断时,内核线程就会辅助软中断,处理软中断的数据,内核线程的优先级比较低,由上面可以看到 nice 值为0,和普通进程一样,这样就会避免普通进程无法获得足够多的运行时间。

    内核线程性能问题:
    在 Linux 中,每个 CPU 都对应一个软中断内核线程,名字是 ksoftirqd/CPU 编号。当软中断事件的频率过高时,内核线程也会因为 CPU 使用率过高而导致软中断处理不及时,进而引发网络收发延迟、调度缓慢等性能问题。
    软中断 CPU 使用率(softirq)升高是一种很常见的性能问题。虽然软中断的类型很多,但实际生产中,我们遇到的性能瓶颈大多是网络收发类型的软中断,特别是网络接收的软中断。

    3.3.2 内核源码解析

    (1)softirqs

    // linux-3.10/fs/proc/softirqs.c
    
    /*
     * /proc/softirqs  ... display the number of softirqs
     */
    static int show_softirqs(struct seq_file *p, void *v)
    {
    	int i, j;
    
    	seq_puts(p, "                    ");
    	for_each_possible_cpu(i)
    		seq_printf(p, "CPU%-8d", i);
    	seq_putc(p, '\n');
    
    	for (i = 0; i < NR_SOFTIRQS; i++) {
    		seq_printf(p, "%12s:", softirq_to_name[i]);
    		for_each_possible_cpu(j)
    			seq_printf(p, " %10u", kstat_softirqs_cpu(i, j));
    		seq_putc(p, '\n');
    	}
    	return 0;
    }
    
    static int softirqs_open(struct inode *inode, struct file *file)
    {
    	return single_open(file, show_softirqs, NULL);
    }
    
    static const struct file_operations proc_softirqs_operations = {
    	.open		= softirqs_open,
    	.read		= seq_read,
    	.llseek		= seq_lseek,
    	.release	= single_release,
    };
    
    static int __init proc_softirqs_init(void)
    {
    	proc_create("softirqs", 0, NULL, &proc_softirqs_operations);
    	return 0;
    }
    module_init(proc_softirqs_init);
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    // linux-3.10/include/linux/interrupt.h
    
    /* PLEASE, avoid to allocate new softirqs, if you need not _really_ high
       frequency threaded job scheduling. For almost all the purposes
       tasklets are more than enough. F.e. all serial device BHs et
       al. should be converted to tasklets, not to softirqs.
     */
    
    enum
    {
    	HI_SOFTIRQ=0,
    	TIMER_SOFTIRQ,
    	NET_TX_SOFTIRQ,
    	NET_RX_SOFTIRQ,
    	BLOCK_SOFTIRQ,
    	BLOCK_IOPOLL_SOFTIRQ,
    	TASKLET_SOFTIRQ,
    	SCHED_SOFTIRQ,
    	HRTIMER_SOFTIRQ,
    	RCU_SOFTIRQ,    /* Preferable RCU should always be the last softirq */
    
    	NR_SOFTIRQS
    };
    
    /* map softirq index to softirq name. update 'softirq_to_name' in
     * kernel/softirq.c when adding a new softirq.
     */
    extern char *softirq_to_name[NR_SOFTIRQS];
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    // linux-3.10/kernel/softirq.c
    
    static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;
    
    char *softirq_to_name[NR_SOFTIRQS] = {
    	"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "BLOCK_IOPOLL",
    	"TASKLET", "SCHED", "HRTIMER", "RCU"
    };
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    定义per-cpu变量 :struct kernel_stat kstat,并且将 kstat 符号导出

    // linux-3.10/kernel/sched/core.c
    
    DEFINE_PER_CPU(struct kernel_stat, kstat);
    EXPORT_PER_CPU_SYMBOL(kstat);
    
    • 1
    • 2
    • 3
    • 4
    [root@localhost ~]# cat /proc/kallsyms | grep '\'
    0000000000015b60 A kstat
    
    • 1
    • 2
    [root@localhost ~]# cat /proc/kallsyms | grep '\<__per_cpu_start\>'
    0000000000000000 A __per_cpu_start
    [root@localhost ~]# cat /proc/kallsyms | grep '\<__per_cpu_end\>'
    000000000001d000 A __per_cpu_end
    
    • 1
    • 2
    • 3
    • 4

    kstat 在 _per_cpu_start 和 __per_cpu_end 范围内,是内核中的per-cpu变量。

    读取 softirqs 数据:

    // linux-3.10/include/linux/kernel_stat.h
    
    struct kernel_stat {
    #ifndef CONFIG_GENERIC_HARDIRQS
           unsigned int irqs[NR_IRQS];
    #endif
    	unsigned long irqs_sum;
    	unsigned int softirqs[NR_SOFTIRQS];
    };
    
    DECLARE_PER_CPU(struct kernel_stat, kstat);
    
    #define kstat_cpu(cpu) per_cpu(kstat, cpu)
    
    static inline unsigned int kstat_softirqs_cpu(unsigned int irq, int cpu)
    {
           return kstat_cpu(cpu).softirqs[irq];
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    (2)ksoftirqd

    定义ksoftirqd,用一个struct task_struct表示,存放在per-cpu内存中:

    // linux-3.10/kernel/softirq.c
    
    DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
    
    /*
     * we cannot loop indefinitely here to avoid userspace starvation,
     * but we also don't want to introduce a worst case 1/HZ latency
     * to the pending events, so lets the scheduler to balance
     * the softirq load for us.
     */
    /*
    不能在这里无限循环以避免用户空间不足,但我们也不想为 the pending events 引入最坏的1/HZ延迟
    因此让调度器为我们平衡 the softirq 负载。
    */
    static void wakeup_softirqd(void)
    {
    	/* Interrupts are disabled: no need to stop preemption */
    	struct task_struct *tsk = __this_cpu_read(ksoftirqd);
    
    	if (tsk && tsk->state != TASK_RUNNING)
    		wake_up_process(tsk);
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22

    声明 ksoftirqd:

    // linux-3.10/include/linux/interrupt.h
    
    DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
    
    static inline struct task_struct *this_cpu_ksoftirqd(void)
    {
    	return this_cpu_read(ksoftirqd);
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    参考资料

    Linux内核 3.10.0

    Linux内核设计与实现
    极客时间:Linux性能优化实战

  • 相关阅读:
    Kubernetes快速部署
    spring boot 集成 swagger3
    【c++随笔08】可变参数——va_list、va_start、va_end、va_arg
    云计算虚拟化Libvirt Domain XML Format中文版—对照学习使用
    SWUST OJ#99 欧几里得博弈
    day07 Elasticsearch搜索引擎3
    Hexo+Github+Vscode搭建个人博客内含添加图片和更换主题
    [Apple][macOS]没有原来的苹果设备接收验证码,怎么激活新的苹果设备(Macbook、iPhone之类)?
    Photoshop利用置换滤镜制作文字人像
    IB中文:语言与文学中的非文学语篇
  • 原文地址:https://blog.csdn.net/weixin_45030965/article/details/128078936