• Linux 系统中提供CPU性能分析工具整理


    Linux 系统中提供CPU性能分析工具整理

    汇总

    在这里插入图片描述

    查看CPU信息

    在linux操作系统中,CPU的信息在启动的过程中被装载到虚拟目录/proc下的cpuinfo文件中,我们可以通过 cat /proc/cpuinfo 查看一下:

    cat /proc/cpuinfo
    
    • 1

    显示如下:

    root@thead-910:~# cat /proc/cpuinfo
    processor       : 0
    hart            : 0
    isa             : rv64imafdcsu
    mmu             : sv39
    model name      : T-HEAD C910
    freq            : 1.2GHz
    icache          : 64kB
    dcache          : 64kB
    l2cache         : 2MB
    tlb             : 1024 4-ways
    cache line      : 64Bytes
    address sizes   : 40 bits physical, 39 bits virtual
    vector version  : 0.7.1
    
    processor       : 1
    hart            : 1
    isa             : rv64imafdcsu
    mmu             : sv39
    model name      : T-HEAD C910
    freq            : 1.2GHz
    icache          : 64kB
    dcache          : 64kB
    l2cache         : 2MB
    tlb             : 1024 4-ways
    cache line      : 64Bytes
    address sizes   : 40 bits physical, 39 bits virtual
    vector version  : 0.7.1
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28

    通过 cpuinfo 可以看到主要信息:

    两颗 RISC-V 64位处理器,型号 T-HEAD C910
    CPU 主频:1.2GHz
    支持的 ISA 特性:
    i: 基础整形指令集
    m: 整型乘除法指令扩展
    a: 原子操作指令扩展
    f: 单精度指令扩展
    d: 双精度指令扩展
    c: 16位压缩指令扩展
    s: 特权态支持
    u: 用户态支持
    I/D CACHE: 64KB
    二级 CACHE: 2MB
    MMU TLB:4路1024个表项
    物理地址位宽:40位
    虚拟地址位宽: 39位
    矢量加速指令集版本: 0.7.1
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    测量处理器运算能力 dhrystone

    Dhrystone是测量处理器运算能力的最常见基准程序之一,常用于处理器的整型运算性能的测量。Dhrystone的计量单位为每秒计算多少次Dhrystone,后来把在VAX-11/780机器上的测试结果1757 Dhrystones/s定义为1 Dhrystone MIPS(百万条指令每秒)。但其也有许多不足,Dhrystone不仅不适合于作为嵌入式系统的测试向量,甚至在其大多数场合下都不适合进行应用。

    安装 dhrystone:

    apt install -y dhrystone
    
    • 1
    ICE-EVB 开发板测试:
    
    dhrystone
    Copy
    在 ICE-EVB 开发板上运行结果如下:
    
    Dhrystone Benchmark, Version 2.1 (Language: C)
    
    Execution starts, 600000000 runs through Dhrystone
    Execution ends
    
    Final values of the variables used in the benchmark:
    
    Int_Glob:            5
            should be:   5
    Bool_Glob:           1
            should be:   1
    Ch_1_Glob:           A
            should be:   A
    Ch_2_Glob:           B
            should be:   B
    Arr_1_Glob[8]:       7
            should be:   7
    Arr_2_Glob[8][7]:    600000010
            should be:   Number_Of_Runs + 10
    Ptr_Glob->
      Ptr_Comp:          481216
            should be:   (implementation-dependent)
      Discr:             0
            should be:   0
      Enum_Comp:         2
            should be:   2
      Int_Comp:          17
            should be:   17
      Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
            should be:   DHRYSTONE PROGRAM, SOME STRING
    Next_Ptr_Glob->
      Ptr_Comp:          481216
            should be:   (implementation-dependent), same as above
      Discr:             0
            should be:   0
      Enum_Comp:         1
            should be:   1
      Int_Comp:          18
            should be:   18
      Str_Comp:          DHRYSTONE PROGRAM, SOME STRING
            should be:   DHRYSTONE PROGRAM, SOME STRING
    Int_1_Loc:           5
            should be:   5
    Int_2_Loc:           13
            should be:   13
    Int_3_Loc:           7
            should be:   7
    Enum_Loc:            1
            should be:   1
    Str_1_Loc:           DHRYSTONE PROGRAM, 1'ST STRING
            should be:   DHRYSTONE PROGRAM, 1'ST STRING
    Str_2_Loc:           DHRYSTONE PROGRAM, 2'ND STRING
            should be:   DHRYSTONE PROGRAM, 2'ND STRING
    
    Microseconds for one run through Dhrystone:    0.1 
    Dhrystones per Second:                      12329495.0
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62

    换算:

    12329495/1757/1200=5.8 DMIPS/MHz
    
    • 1

    衡量CPU性能 coremark

    CoreMark是用来衡量CPU性能的标准。包含如下的运算法则:列举(寻找并排序),数学矩阵操作(普通矩阵运算)和状态机(用来确定输入流中是否包含有效数字),最后还包括CRC(循环冗余校验)。

    安装 coremark:

    apt install -y coremark
    
    • 1

    ICE-EVB 开发板测试命令:

    coremark
    
    • 1

    在 ICE-EVB 开发板上运行结果如下:

    2K performance run parameters for coremark.
    CoreMark Size    : 666
    Total ticks      : 13193
    Total time (secs): 13.193000
    Iterations/Sec   : 8337.754870
    Iterations       : 110000
    Compiler version : GCC8.4.0
    Compiler flags   : -O2 -O3 -static -funroll-all-loops -finline-limit=500 -fgcse-sm -fno-schedule-insns --param max-rtl-if-conversion-unpredictable-cost=100 -msignedness-cmpiv -fno-code-hoisting -mno-thread-jumps1 -mno-iv-adjust-addr-cost -mno-expand-split-imm  -lrt
    Memory location  : Please put data memory location here
                (e.g. code in flash, data on heap etc)
    seedcrc          : 0xe9f5
    [0]crclist       : 0xe714
    [0]crcmatrix     : 0x1fd7
    [0]crcstate      : 0x8e3a
    [0]crcfinal      : 0x33ff
    Correct operation validated. See readme.txt for run and reporting rules.
    CoreMark 1.0 : 8337.754870 / GCC8.4.0 -O2 -O3 -static -funroll-all-loops -finline-limit=500 -fgcse-sm -fno-schedule-insns --param max-rtl-if-conversion-unpredictable-cost=100 -msignedness-cmpiv -fno-code-hoisting -mno-thread-jumps1 -mno-iv-adjust-addr-cost -mno-expand-split-imm  -lrt / Heap
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17

    换算:

    8337.754870 / 1200 = 6.948
    
    • 1

    其中 1200表示 1200MHz,也就是 1.2GHz,也即 ICE 开发板的 CPU 频率。

    whetstone

    whetstone是一个用于测量CPU浮点计算性能的工具。

    安装 whetstone:

    apt install -y whetstone
    
    • 1

    ICE-EVB 开发板测试:

    root@thead-910:~# whetstone 1000000
    Loops: 1000000, Iterations: 1, Duration: 19.778 sec.
    C Converted Double Precision Whetstones: 5056.029 MIPS
    
    • 1
    • 2
    • 3

    换算:

    5056.029/1200=4.2 MIPS/MHz
    
    • 1

    其中 1200表示 1200MHz,也就是 1.2GHz,也即 ICE 开发板的 CPU 频率。

    linpack

    通过利用高性能计算机,用高斯消元法求解N元一次稠密线性代数方程组的测试,评价高性能计算机的浮点性能。

    安装 linpack:

    apt install -y linpack
    
    • 1

    ICE-EVB 开发板测试:

    root@thead-910:~# linpack
    Unrolled Double Precision Linpack
    
    Unrolled Double Precision Linpack
    
         norm. resid      resid           machep         x[0]-1        x[n-1]-1
           2.7        2.40728912e-13  2.22044605e-16  4.08562073e-14 -2.94209102e-14
        times are reported for matrices of order   200
          dgefa      dgesl      total       kflops     unit      ratio
     times for array with leading dimension of  201
           0.01       0.00       0.01     644521       0.00       0.15
           0.01       0.00       0.01     646445       0.00       0.15
           0.01       0.00       0.01     647683       0.00       0.15
           0.01       0.00       0.01     651032       0.00       0.15
     times for array with leading dimension of 200
           0.01       0.00       0.01     691623       0.00       0.14
           0.01       0.00       0.01     693661       0.00       0.14
           0.01       0.00       0.01     696428       0.00       0.14
           0.01       0.00       0.01     696383       0.00       0.14
    Unrolled Double  Precision 651032 Kflops ; 10 Reps
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    换算:

    651032/1200=542.5 KFLOPS/MHz
    
    • 1

    其中 1200表示 1200MHz,也就是 1.2GHz,也即 ICE 开发板的 CPU 频率。

    stream

    Stream测试是内存测试中业界公认的内存带宽性能测试基准工具。

    安装 stream:

    apt install -y stream
    
    • 1

    ICE-EVB 开发板测试:

    root@thead-910:~# stream
    -------------------------------------------------------------
    STREAM version $Revision: 5.10 $
    -------------------------------------------------------------
    This system uses 8 bytes per array element.
    -------------------------------------------------------------
    Array size = 10000000 (elements), Offset = 0 (elements)
    Memory per array = 76.3 MiB (= 0.1 GiB).
    Total memory required = 228.9 MiB (= 0.2 GiB).
    Each kernel will be executed 10 times.
     The *best* time for each kernel (excluding the first iteration)
     will be used to compute the reported bandwidth.
    -------------------------------------------------------------
    Your clock granularity/precision appears to be 1 microseconds.
    Each test below will take on the order of 30088 microseconds.
       (= 30088 clock ticks)
    Increase the size of the arrays if this shows that
    you are not getting at least 20 clock ticks per test.
    -------------------------------------------------------------
    WARNING -- The above is only a rough guideline.
    For best results, please be sure you know the
    precision of your system timer.
    -------------------------------------------------------------
    Function    Best Rate MB/s  Avg time     Min time     Max time
    Copy:            6965.0     0.022986     0.022972     0.023003
    Scale:           7020.0     0.022831     0.022792     0.022890
    Add:             6095.7     0.039484     0.039372     0.039601
    Triad:           6105.9     0.040019     0.039306     0.042522
    -------------------------------------------------------------
    Solution Validates: avg error less than 1.000000e-13 on all three arrays
    -------------------------------------------------------------
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31

    sysbench

    sysbench是一个开源的、模块化的、跨平台的多线程性能测试工具,可以用来进行CPU、内存、磁盘I/O、线程、数据库的性能测试。

    安装 sysbench:

    apt install -y sysbench
    
    • 1

    ICE-EVB 开发板测试:

    测试线程:

    bash root@thead-910:/# ./sysbench --test=threads --num-threads=32 --thread-yields=100 --thread-lock=8 run sysbench 0.5: multi-threaded system evaluation benchmark
    
    Running the test with following options: Number of threads: 32 Random number generator seed is 0 and will be ignored
    
    Threads started!
    
    General statistics: total time: 2.9138s total number of events: 10000 total time taken by event execution: 93.1165s response time: min: 0.14ms avg: 9.31ms max: 82.38ms approx. 95 percentile: 32.83ms
    
    Threads fairness: events (avg/stddev): 312.5000/14.83 execution time (avg/stddev): 2.9099/0.00
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    换算:

      14.83/2.9099=5.1
    
    • 1

    其他测试项目:

    • 测试CPU:

      root@thead-910:~# ./sysbench --test=cpu --cpu-max-prime=20000 --num-threads=2 run
      sysbench 0.5:  multi-threaded system evaluation benchmark
      
      Running the test with following options:
      Number of threads: 2
      Random number generator seed is 0 and will be ignored
      
      
      Primer numbers limit: 20000
      
      Threads started!
      
      
      General statistics:
          total time:                          15.0576s
          total number of events:              10000
          total time taken by event execution: 30.1038s
          response time:
               min:                                  3.00ms
               avg:                                  3.01ms
               max:                                  3.12ms
               approx.  95 percentile:               3.01ms
      
      Threads fairness:
          events (avg/stddev):           5000.0000/0.00
          execution time (avg/stddev):   15.0519/0.00
      
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
    • 测试IO:

    root@thead-910:~# .sysbench --test=fileio --num-threads=16 --file-total-size=2G --file-test-mode=rndrw prepare
    sysbench 0.5:  multi-threaded system evaluation benchmark
    
    128 files, 16384Kb each, 2048Mb total
    Creating files for the test...
    Extra file open flags: 0
    Creating file test_file.0
    Creating file test_file.1
    ...
    Creating file test_file.125
    Creating file test_file.126
    Creating file test_file.127
    2147483648 bytes written in 110.50 seconds (18.53 MB/sec).
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    root@thead-910:~# sysbench --test=fileio --num-threads=20 --file-total-size=2G --file-test-mode=rndrw run sysbench 0.5: multi-threaded system evaluation benchmark
    
    Running the test with following options: Number of threads: 20 Random number generator seed is 0 and will be ignored
    
    Extra file open flags: 0 128 files, 16Mb each 2Gb total file size Block size 16Kb Number of IO requests: 10000 Read/Write ratio for combined random IO test: 1.50 Periodic FSYNC enabled, calling fsync() each 100 requests. Calling fsync() at the end of test, Enabled. Using synchronous I/O mode Doing random r/w test Threads started!
    
    Operations performed: 6000 reads, 4000 writes, 12800 Other = 22800 Total Read 93.75Mb Written 62.5Mb Total transferred 156.25Mb (34.834Mb/sec) 2229.38 Requests/sec executed
    
    General statistics: total time: 4.4856s total number of events: 10000 total time taken by event execution: 0.8171s response time: min: 0.01ms avg: 0.08ms max: 20.83ms approx. 95 percentile: 0.08ms
    
    Threads fairness: events (avg/stddev): 500.0000/179.75 execution time (avg/stddev): 0.0409/0.02
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
      root@thead-910:~# ./sysbench --test=fileio --num-threads=20 --file-total-size=2G --file-test-mode=rndrw cleanup
      sysbench 0.5:  multi-threaded system evaluation benchmark
    
      Removing test files...
    
    • 1
    • 2
    • 3
    • 4
    • 测试内存:
    root@thead-910:~# sysbench --test=memory --memory-block-size=8k --memory-total-size=1G run sysbench 0.5: multi-threaded system evaluation benchmark
    
    Running the test with following options: Number of threads: 1 Random number generator seed is 0 and will be ignored
    
    Threads started!
    
    Operations performed: 131072 (374615.49 ops/sec)
    
    1024.00 MB transferred (2926.68 MB/sec)
    
    General statistics: total time: 0.3499s total number of events: 131072 total time taken by event execution: 0.2768s response time: min: 0.00ms avg: 0.00ms max: 0.06ms approx. 95 percentile: 0.00ms
    
    Threads fairness: events (avg/stddev): 131072.0000/0.00 execution time (avg/stddev): 0.2768/0.00
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 测试mutex:
      root@thead-910:~# sysbench --test=mutex --num-threads=64 run
      sysbench 0.5:  multi-threaded system evaluation benchmark
    
      Running the test with following options:
      Number of threads: 64
      Random number generator seed is 0 and will be ignored
    
    
      Threads started!
    
    
      General statistics:
          total time:                          1.2616s
          total number of events:              64
          total time taken by event execution: 76.6870s
          response time:
               min:                               1066.46ms
               avg:                               1198.23ms
               max:                               1254.10ms
               approx.  95 percentile:            1251.30ms
    
      Threads fairness:
          events (avg/stddev):           1.0000/0.00
          execution time (avg/stddev):   1.1982/0.05
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24

    nbench

    nbench是一个简单的用于测试处理器、存储器性能的基准测试程序。

    安装 nbench:

    apt install -y nbench
    
    • 1

    ICE-EVB 开发板测试:

    root@thead-910:~# nbench
    
    BYTEmark* Native Mode Benchmark ver. 2 (10/95)
    Index-split by Andrew D. Balsa (11/97)
    Linux/Unix* port by Uwe F. Mayer (12/96,11/97)
    
    TEST                : Iterations/sec.  : Old Index   : New Index
                        :                  : Pentium 90* : AMD K6/233*
    --------------------:------------------:-------------:------------
    NUMERIC SORT        :          676.01  :      17.34  :       5.69
    STRING SORT         :          148.55  :      66.38  :      10.27
    BITFIELD            :      1.6208e+08  :      27.80  :       5.81
    FP EMULATION        :          154.38  :      74.08  :      17.09
    FOURIER             :           21743  :      24.73  :      13.89
    ASSIGNMENT          :           17.95  :      68.30  :      17.72
    IDEA                :          3738.9  :      57.19  :      16.98
    HUFFMAN             :          1657.3  :      45.96  :      14.68
    NEURAL NET          :           14.84  :      23.84  :      10.03
    LU DECOMPOSITION    :          649.13  :      33.63  :      24.28
    ==========================ORIGINAL BYTEMARK RESULTS==========================
    INTEGER INDEX       : 45.842
    FLOATING-POINT INDEX: 27.063
    Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
    ==============================LINUX DATA BELOW===============================
    CPU                 : Dual T-HEAD C910
    L2 Cache            :
    OS                  : Linux 5.10.4
    C compiler          : riscv64-unknown-linux-gnu-gcc
    libc                : static
    MEMORY INDEX        : 10.186
    INTEGER INDEX       : 12.479
    FLOATING-POINT INDEX: 15.010
    Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38
    * Trademarks are property of their respective holder.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34

    cache_calibrator

    cache_calibrator 可以测量 cache 相关的性能参数。

    安装 cache_calibrator:

    apt install -y cache-calibrator
    
    • 1

    ICE-EVB 开发板测试:

    cache_calibrator 1200 10M file
    
    • 1

    在 ICE-EVB 开发板上运行结果如下:

    Calibrator v0.9e
    (by Stefan.Manegold@cwi.nl, http://www.cwi.nl/~manegold/)
    e16a9010 274364796944 4096    16
    e16a9fff 274364801023 4096  4095
    e16aa000 274364801024 4096     0
    
    MINTIME = 10000
    
    analyzing cache throughput...
          range      stride       spots     brutto-  netto-time
       10485760           8     1310720       17472         273
    
    analyzing cache latency...
          range      stride       spots     brutto-  netto-time
       10485760           8     1310720      139776         273
    
    analyzing TLB latency...
          range      stride       spots     brutto-  netto-time
        1474560        1152        1280       18176        1136
    
    
    CPU loop + L1 access:       2.50 ns =   3 cy
                 ( delay:       0.00 ns =   0 cy )
    
    caches:
    level  size    linesize   miss-latency        replace-time
      1     64 KB  128 bytes    9.19 ns =  11 cy    9.16 ns =  11 cy
      2      2 MB  128 bytes   19.05 ns =  23 cy   19.09 ns =  23 cy
    
    TLBs:
    level #entries  pagesize  miss-latency
      1       20       4 KB     3.34 ns =   4 cy 
    
    # ls
    file.cache-replace-time.gp
    file.TLB-miss-latency.data
    file.TLB-miss-latency.gp
    file.cache-miss-latency.data
    file.cache-miss-latency.gp
    file.cache-replace-time.data
    
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42

    其中 1200表示 1200MHz,也就是 1.2GHz,也即 ICE 开发板的 CPU 频率;跑完后生成了很多以“file”开头的文件,可以观测 cache 各个方面的性能。

    iperf3

    iperf 用于测量网络性能。

    安装iperf3:

    apt install -y iperf3
    
    • 1

    启动 perf3 server:

    iperf3 -s > /dev/null &
    
    • 1

    开始测试:

    iperf3 -c 127.0.0.1
    
    • 1

    在ICE 开发板上,测试数组如下:

    root@thead-910:~# iperf3 -c 127.0.0.1
    Connecting to host 127.0.0.1, port 5201
    [  5] local 127.0.0.1 port 51114 connected to 127.0.0.1 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
    [  5]   0.00-1.00   sec   454 MBytes  3.80 Gbits/sec    0   1.12 MBytes
    [  5]   1.00-2.00   sec   452 MBytes  3.79 Gbits/sec    0   1.12 MBytes
    [  5]   2.00-3.00   sec   430 MBytes  3.61 Gbits/sec    0   1.12 MBytes
    [  5]   3.00-4.00   sec   445 MBytes  3.73 Gbits/sec    0   1.12 MBytes
    [  5]   4.00-5.00   sec   452 MBytes  3.79 Gbits/sec    0   1.50 MBytes
    [  5]   5.00-6.00   sec   442 MBytes  3.71 Gbits/sec    0   1.50 MBytes
    [  5]   6.00-7.00   sec   448 MBytes  3.76 Gbits/sec    0   1.94 MBytes
    [  5]   7.00-8.00   sec   451 MBytes  3.78 Gbits/sec    0   2.12 MBytes
    [  5]   8.00-9.00   sec   464 MBytes  3.90 Gbits/sec    0   2.81 MBytes
    [  5]   9.00-10.00  sec   458 MBytes  3.84 Gbits/sec    0   3.00 MBytes
    - - - - - - - - - - - - - - - - - - - - - - - - -
    [ ID] Interval           Transfer     Bitrate         Retr
    [  5]   0.00-10.00  sec  4.39 GBytes  3.77 Gbits/sec    0             sender
    [  5]   0.00-10.00  sec  4.39 GBytes  3.77 Gbits/sec                  receiver
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    如果需要测试物理网卡的速度,可以在远程电脑上启动 iperf3 服务,然后再通过 iperf3 -c 来测试网速。

    压力测试神器 stress_ng

    stress-ng是stress的加强版,完全兼容stress,并在此基础上增加了几百个参数,堪称压测工具中的瑞士军刀。

    安装 stress-ng:

    apt install -y stress-ng
    
    • 1
    应用场景1:CPU 密集型进程(使用CPU的进程)
    
    stress-ng --cpu 2 --cpu-method all
    
    应用场景2:I/O 密集型进程(等待IO的进程)
    
    stress-ng -i 4 --hdd 1 --timeout 600
    
    应用场景3:大量进程的场景(等待CPU的进程->进程间会争抢CPU)
    
    stress-ng -c 16 --timeout 600
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

    性能分析工具 – perf

    Perf 是内置于 Linux 内核源码树中的性能剖析(profiling)工具,它基于事件采样原理,以性能事件为基础,支持针对处理器相关性能指标与操作系统相关性能指标的性能剖析,常用于性能瓶颈的查找与热点代码的定位。

    通过它,应用程序可以利用 PMU,tracepoint 和内核中的特殊计数器来进行性能统计。它不但可以分析指定应用程序的性能问题 (per thread),也可以用来分析内核的性能问题,当然也可以同时分析应用代码和内核,从而全面理解应用程序中的性能瓶颈。

    使用 perf,您可以分析程序运行期间发生的硬件事件,比如 instructions retired ,processor clock cycles 等;您也可以分析软件事件,比如 Page Fault 和进程切换。这使得 Perf 拥有了众多的性能分析能力,举例来说,使用 Perf 可以计算每个时钟周期内的指令数,称为 IPC,IPC 偏低表明代码没有很好地利用 CPU。Perf 还可以对程序进行函数级别的采样,从而了解程序的性能瓶颈究竟在哪里等等。Perf 还可以替代 strace,可以添加动态内核 probe 点,还可以做 benchmark 衡量调度器的好坏。

    安装perf

    apt install perf
    
    • 1

    装完之后就可以使用perf命令了。

     usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
    
     The most commonly used perf commands are:
       annotate        Read perf.data (created by perf record) and display annotatedd code
       archive         Create archive with object files with build-ids found in perff.data file
       bench           General framework for benchmark suites
       buildid-cache   Manage build-id cache.
       buildid-list    List the buildids in a perf.data file
       c2c             Shared Data C2C/HITM Analyzer.
       config          Get and set variables in a configuration file.
       data            Data file related processing
       diff            Read perf.data files and display the differential profile
       evlist          List the event names in a perf.data file
       ftrace          simple wrapper for kernel's ftrace functionality
       inject          Filter to augment the events stream with additional informatiion
       kallsyms        Searches running kernel for symbols
       kmem            Tool to trace/measure kernel memory properties
       kvm             Tool to trace/measure kvm guest os
       list            List all symbolic event types
       lock            Analyze lock events
       mem             Profile memory accesses
       record          Run a command and record its profile into perf.data
       report          Read perf.data (created by perf record) and display the profiile
       sched           Tool to trace/measure scheduler properties (latencies)
       script          Read perf.data (created by perf record) and display trace outtput
       stat            Run a command and gather performance counter statistics
       test            Runs sanity tests.
       timechart       Tool to visualize total system behavior during a workload
       top             System profiling tool.
       version         display the version of perf binary
    
     See 'perf help COMMAND' for more information on a specific command.
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 安装flamegraph

    火焰图是脚本,只需要下载:

    git clone https://github.com/brendangregg/FlameGraph.git
    
    • 1

    查看当前软硬件环境、支持的性能事件
    查看所有分类事件的个数:

    perf list | awk -F: '/Tracepoint event/ { lib[$1]++ } END { for (l in lib) { printf " %-16s %d\n", l, lib[l] } }' | sort | column
    
    • 1

    在 ICE 芯片运行结果如下:

    root@thead-910:~# perf list | awk -F: '/Tracepoint event/ { lib[$1]++ } END { for (l in lib) { printf " %-16s %d\n", l, lib[l] } }' | sort | column
       9p             3        iomap          9        random         16
       alarmtimer     4        irq            5        ras            4
       block          19       jbd2           17       raw_syscalls   2
       bpf_test_run   1        kmem           13       rcu            1
       bpf_trace      1        kyber          3        regmap         15
       cgroup         13       mdio           1        rpcgss         26
       clk            16       migrate        1        sched          20
       compaction     14       mmap           1        signal         2
       cpuhp          3        mmc            2        skb            3
       dma_fence      7        module         5        smbus          4
       dwc3           15       napi           1        sock           3
       ext4           117      neigh          7        spi            7
       fib            1        net            18       sunrpc         127
       fib6           1        nfs            58       swiotlb        1
       filelock       12       nfs4           87       task           2
       filemap        4        oom            8        tcp            7
       ftrace         1        page_pool      4        timer          12
       gadget         24       pagemap        2        udp            1
       gpio           2        percpu         5        vmscan         14
       i2c            4        power          22       workqueue      4
       initcall       3        printk         1        writeback      30
       io_uring       14       qdisc          4        xdp            12
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • flamegraph火焰图

    如果希望了解CPU在一段时间内的都运行了哪些函数以及各函数都消耗了多少时间,就可以使用On CPU火焰图,这种火焰图基于cpu-cycles事件进行采样,然后通过svg图片格式展现出来

    dd if=/dev/zero of=/tmp/testfile bs=4K count=102400 &
    perf record -e cpu-cycles -a -g sleep 1
    perf script > perf.unfold
    cd FlameGraph
    ./stackcollapse-perf.pl < ../perf.unfold | ./flamegraph.pl > ../perf.svg
    
    • 1
    • 2
    • 3
    • 4
    • 5

    首先在后台启动一个dd命令,让它持续运行一段时间,然后开启perf record,记录一秒钟内cpu都运行了多少个cpu-cycles,也就是时间(同时使能-g,就会一并记录运行的函数以及调用关系),再利用perf script命令将perf.data转成perf.unfold,最后利用FlameGraph工具将其转换成一个perf.svg,这是一个图形文件,用浏览器打开后会得到这样图记录着函数调用关系及其cpu-cycles(时间)占比,就像一缕缕升起的火苗,所以被称之为火焰图。

    火焰图还可以通过鼠标点击放大,观察其细节。

  • 相关阅读:
    精通Nmap:网络扫描与安全的终极武器
    车辆信息查询API-支持车牌号码快速查询
    【LeetCode】49. 字母异位词分组
    mysql集群使用nginx配置负载均衡
    一篇文章教会你写一个贪吃蛇小游戏(纯C语言)
    vue-draggable-resizable 插件使用总结
    国内程序员真的不如国外国外程序员?到底差在哪里?
    (附源码)ssm人力资源管理系统 毕业设计 271621
    ROBOGUIDE软件:FANUC机器人弧焊焊接起始点接触寻位虚拟仿真
    lua学习笔记
  • 原文地址:https://blog.csdn.net/heroybc/article/details/133804896