• Linux 内存管理 概述与深入理解


    为什么要学习内存管理 (WHY)

    关注越界访问,避免有类型错误或者算术溢出带来的内存问题

    比如 char类型,在127+1后编程-128,导致越界访问或者擦写内存的问题。

    避免内存泄漏

    了解内存分配与释放机制,避免进行某些会导致内存泄漏的操作。

    段错误

    了解段错误发生的机制,淡定填坑。

    Double Free

    关注双重释放带来的后果,对free操作有敬畏之心。

    内存碎片与性能

    了解分配机制,避免内存碎片以及分配和释放带来的性能影响

    性能瓶颈问题分析

    应用可能一条语句,内核就要执行上千条。正所谓“领导一句话,下属跑断腿” io read write发生了两次,用mmap只需要一次拷贝 

    • 如何处理CPU指令重排问题?
    • 如何高效利用Cpu Cache?
    • 如何利用CPU 分支预测(乱序执行)?
    • 如何利用内存指令重排代替锁,实现无锁队列或同步问题?

    内存映射几种类型

    私有匿名映射

    各进程不共享,fork之后,会进行写时复制

        mmap(NULL, 1024, PORT_WRITE|PORT_READ, MAP_ANON|MAP_PRIVATE, -1, 0)   
    

    共享匿名映射

    各进程共享,会写时复制,常用进程间通信

        mmap(NULL, 1024, PORT_WRITE|PORT_READ, MAP_ANON|MAP_SHARED, -1, 0)   
    

    私有文件映射

    各进程不共享,常见是私有的动态库加载

        mmap(NULL, 1024, PORT_WRITE|PORT_READ, MAP_RPIVATE, fd, 0)   
    

    共享文件映射

    各进程共享,文件映射,传输

        mmap(NULL, 1024, PORT_WRITE|PORT_READ, MAP_SHARED, fd, 0)   
    

    查看命令:

        cat /proc/<pid>/maps   
    

    进程地址空间

    简介

    在Linux中,3G~4G的内存空间,划分给了内核,所有进程中该部分都是一样的。 地址由低到高,分别是代码段区域,数据段区域(Data, Bss)、堆区、栈区。其中栈区和堆区之间,存放着mmap开辟的内存映射区域。一般栈的生长方向是由高地址向低地址。而堆则是由低至高地址生长。堆区的顶部指针,称为brk地址。在内核空间中,将一部分虚拟地址空间线性映射至物理地址空间中,32位中一般为896M,这部分称为低端内存。剩下一部分常规可分配内存称为高端内存(还有一部分是DMA内存区)。

    内存分配

    从进程地址空间可以看出,用户分配小内存,则是在堆区线性扩展。向上增大或者向下减小区域,对于通过brk系统调用来完成。 另外,对于分配大内存,超过128K时,使用mmap系统调用来在堆区和栈区之间分配内存。 

    malloc工作原理

    参考:linux - How does glibc malloc work? - Reverse Engineering Stack Exchange

    malloc内存块组织方式

    1. 内存块头部
    2. {
    3. 块状态: 空闲,在使用;
    4. 前一块指针;
    5. 下一块指针;
    6. 内存块大小:
    7. }

    glic malloc源码位置:malloc.c source code [glibc/malloc/malloc.c] - Woboq Code Browser

    分配的特点是通过bins管理大小相近的内存。对于ptmalloc一般有fast bins, smaller bins, unsorted bins , larger bins, 四种类型。基本上每种bins的相邻间隔为8个字节,每个bin可能是单向链表或双向链表。对于fast bins一般是不大于64B, smaller bins <= 512B, larger bins > 512B 且128K。 unsorted bins没有规定具体大小,是一个临时管理链表可以存放临时合并的chunk。 对于大于128K之类的内存,会统一放在mmaped chunk 中,释放时直接归还给系统。

    malloc分配的是虚拟地址,管理的是虚拟内存的分配。那么对于物理内存如何管理呢?物理内存的分配使用相关的有slab/slub/slob分配器,还有著名的buddy system伙伴系统。

    Buddy System 伙伴系统

     伙伴系统最小分配单位是页,并且以页的阶数分配。两个页快是否为伙伴的规则如下:

    1. 大小相同,物理连续的两个页块。
    2. n阶的页块,第一个页框编号为2^n的整数倍
    3. 两个相邻n阶页块合并成n+1阶之后,第一个页框编号也为2^(n+1)的整数倍

    linux默认最大分配是10阶页块。在include/linux/mmzone.h 中设置。

    伙伴系统的引入主要是用来解决内存外部碎片的问题。既然有内存外部碎片问题,那就有内存内部碎片的问题。外部碎片是指,没有分配的内存无法利用起来。内部碎片是指已经分配的内存只利用了部分,剩下的内存没有利用起来。而slab分配器的引入,解决了内存内部碎片的问题。

    Slab 分配器

     slab是按照对象的概念来缓冲的。比如上述如果一个对象是512个字节,那么一页就可以换成8个对象。slab的工作就是管理这个8个对象的分配和释放。这里slab管理的内存区为线性映射器区,即虚拟地址和物理地址是线性关系,对于linux一般这个区域大小为896M。常见的内核驱动的对象分配比如kmalloc,kmem_cache_alloc这些接口内部即使使用slab来管理。

    虚拟地址与物理地址的映射关系

    1. 为什么不直接一一映射?内存不足

    2. 为什么采用了多级的映射?节约内存

    3. 为什么有TLB块表?cache

    4. 相同虚拟地址TLB如何区分?flush, ASID。

    5. 不同进程访问共享的内核地址空间如何提高效率?nG , G标志

    6. 为什么有写时复制?避免在fork时就拷贝大量的物理内存数据

    7. mmu和cache哪个先访问?都可以,取决于TLB放在Cache之前还是之后。各有利弊

    8. cache这么小如何缓存这么大的地址空间? (set) = (memory address) / (line size) % (number of sets)

    (critical stride) = (number of sets) * (line size) = (total cache size) / (number of ways).

    在linux上,查看虚拟地址与物理地址的关系:https://github.com/jethrogb/ptdump

    1. git clone https://github.com/jethrogb/ptdump
    2. cd ptdump/dump/linux/
    3. make
    4. sudo insmod ptdump.ko
    5. ls /proc/page*

     运行测试app. test_virtual_address例程:

    • 运行:test_virtual_address, 读取father.log并跟cat /proc//maps比较,发现申请的内存并没有物理地址。
    1. #include
    2. #include
    3. #include
    4. #include
    5. #include
    6. #include
    7. #include
    8. #include
    9. #include
    10. #include
    11. #include
    12. #include
    13. #include
    14. #include
    15. #include
    16. struct Page {
    17. uint64_t address;
    18. uint64_t entry[512];
    19. };
    20. uint64_t get_phy_address(uint64_t entry)
    21. {
    22. static const uint64_t mask = (1LL<<63) | ((1<<12)-1);
    23. return entry & ~mask;
    24. }
    25. bool writable(uint64_t entry)
    26. {
    27. return (entry & (1 << 1)) != 0;
    28. }
    29. bool executable(uint64_t entry)
    30. {
    31. return (entry & (1LL << 63)) == 0;
    32. }
    33. bool user_mode(uint64_t entry)
    34. {
    35. return (entry & (1LL << 2)) != 0;
    36. }
    37. void print_entry(FILE *fp, int level, uint64_t entry, uint64_t virtual_address)
    38. {
    39. fprintf(fp, "%d\t0x%-16llx\t0x%-16llx\t%d\t%d\t%d\n", level, get_phy_address(entry), virtual_address,
    40. writable(entry),executable(entry),user_mode(entry));
    41. }
    42. void dump(FILE *fp, const Page*& page, int level, uint64_t virtual_address)
    43. {
    44. const Page *curr_page=page++;
    45. for (int i = 0; i < 512; i++) {
    46. const uint64_t entry = curr_page->entry[i];
    47. const uint64_t child_virtual_address = (virtual_address << 9) | i;
    48. if (level > 0) {
    49. if (entry & 1) {
    50. if (!(entry &(1 << 7))) {
    51. dump(fp, page, level-1, child_virtual_address);
    52. }else {
    53. print_entry(fp, level, entry, child_virtual_address << (level*9+12));
    54. }
    55. }
    56. } else {
    57. if (entry) {
    58. print_entry(fp, level, entry, child_virtual_address << 12);
    59. }
    60. }
    61. }
    62. }
    63. void dump_table(FILE *fp)
    64. {
    65. std::ifstream ifs("/proc/page_table_3", std::ios::binary);
    66. if (!ifs) {
    67. return ;
    68. }
    69. std::string content((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());
    70. const Page *page = (const Page *)&content[0];
    71. const Page *end_page = (const Page *)(&content[0]+content.length());
    72. dump(fp, page, 3, 0);
    73. std::cout<< (const void *)end_page << "\t" << (const void *)page << std::endl;
    74. std::flush(std::cout);
    75. }
    76. int main(int argc, char **argv)
    77. {
    78. int cpu_index =0;
    79. if(argc >= 3) cpu_index=atoi(argv[2]);
    80. cpu_set_t set;
    81. CPU_ZERO(&set);
    82. CPU_SET(cpu_index, &set);
    83. sched_setaffinity(0, sizeof(set), &set);
    84. const size_t asize = 1024*1024*8;
    85. bool hugetable = false;
    86. bool do_fork = false;
    87. bool do_write = false;
    88. if (argc >= 2) {
    89. switch (argv[1][0])
    90. {
    91. case 'h': hugetable=true;do_write=true; break;
    92. case 'f': do_fork=true;do_write=true; break;
    93. case 'w': do_write=true; break;
    94. default:break;
    95. }
    96. }
    97. char *virtual_ptr = (char *)mmap(NULL, asize, PROT_READ|PROT_WRITE,
    98. MAP_ANONYMOUS|MAP_PRIVATE|(hugetable?MAP_HUGETLB:0), -1, 0);
    99. if (do_write) *virtual_ptr = '\0';
    100. FILE *fp = NULL;
    101. std::cout<< "father pid: " << getpid() << std::endl;
    102. if (do_fork) {
    103. pid_t pid = fork();
    104. if (pid == 0) {
    105. fp = fopen("child.log", "w");
    106. std::cout<< "child pid: " << getpid() << std::endl;
    107. }else {
    108. fp = fopen("father.log", "w");
    109. }
    110. } else {
    111. fp = fopen("father.log", "w");
    112. }
    113. fprintf(fp, "pid=%d map address: %p\n", getpid(), virtual_ptr);
    114. dump_table(fp);
    115. fclose(fp);
    116. while(true) {
    117. usleep(10000);
    118. }
    119. return 0;
    120. }

    • 运行:test_virtual_address w, 读取father.log并跟cat /proc//maps比较,发现申请的内存有了物理地址。

      

    •   运行:test_virtual_address f, 读取father.log和child.log并跟cat /proc//maps比较,可以看到写时复制时的工作原理。注意看权限的不同。

      

    普通文件read和文件映射mmap读取速度比较

    1. static void test_file_read()
    2. {
    3. FILE *fp = NULL;
    4. fp = fopen(FILE_NAME, "rb+");
    5. char *buff = (char *)malloc(1024);
    6. int len = 0;
    7. size_t sum = 0;
    8. size_t fsize;
    9. fseek(fp,0,SEEK_END);
    10. fsize=ftell(fp);
    11. auto start = std::chrono::high_resolution_clock::now();
    12. for (int n = 0; n < 100; n++) {
    13. fseek(fp,0,SEEK_SET);
    14. while((len=fread(buff, 1,1024, fp)) >0 ) {
    15. sum +=len;
    16. }
    17. }
    18. auto end = std::chrono::high_resolution_clock::now();
    19. size_t use_time = std::chrono::duration_cast(end-start).count();
    20. std::cout << "test_file_read elapsed milliseconds: " << use_time << std::endl;
    21. std::flush(std::cout);
    22. printf("size=%ld\n", sum);
    23. fclose(fp);
    24. free(buff);
    25. }
    26. static void test_mmap_read()
    27. {
    28. FILE *fp = NULL;
    29. fp = fopen(FILE_NAME, "rb+");
    30. char *buff = (char *)malloc(1024);
    31. int len;
    32. size_t sum = 0;
    33. int readsize;
    34. size_t fsize;
    35. size_t size;
    36. fseek(fp,0,SEEK_END);
    37. fsize=ftell(fp);
    38. fseek(fp,0,SEEK_SET);
    39. size = fsize;
    40. fclose(fp);
    41. int ffd = open(FILE_NAME, O_RDWR|O_CREAT,0666);
    42. char *ptr=(char *)mmap(NULL, size, PROT_READ, MAP_SHARED, ffd, 0);
    43. close(ffd);
    44. char *readPtr = NULL;
    45. auto start = std::chrono::high_resolution_clock::now();
    46. for (int n = 0; n < 100; n++) {
    47. readPtr = ptr;
    48. size = fsize;
    49. while(size){
    50. readsize = size > 1024 ? 1024 : size;
    51. memcpy(buff, readPtr, readsize);
    52. size -= readsize;
    53. readPtr += readsize;
    54. sum += 1024;
    55. }
    56. }
    57. auto end = std::chrono::high_resolution_clock::now();
    58. size_t use_time = std::chrono::duration_cast(end-start).count();
    59. std::cout << "test_mmap_read elapsed milliseconds: " << use_time << std::endl;
    60. std::flush(std::cout);
    61. free(buff);
    62. munmap(buff, fsize);
    63. printf("size=%ld\n", sum);
    64. }

    运行test_mmap f  和test_mmap m

    1. ~/test_perf$ ./show_perf.sh ./test_mmap f
    2. run ./test_mmap f
    3. test_file_read elapsed milliseconds: 2245
    4. size=10485760000
    5. Performance counter stats for './test_mmap f':
    6. 74,743,510 cache-references # 36.454 M/sec
    7. 35,690,848 cache-misses # 47.751 % of all cache refs
    8. 8,925,687,804 instructions
    9. 1,367,859,141 branches # 667.137 M/sec
    10. 3,955,388 branch-misses # 0.29% of all branches
    11. 117 faults # 0.057 K/sec
    12. 219 context-switches # 0.107 K/sec
    13. 2050.342526 task-clock (msec) # 0.909 CPUs utilized
    14. 0 migrations # 0.000 K/sec
    15. 2.256351930 seconds time elapsed
    16. ~/test_perf$ ./show_perf.sh ./test_mmap m
    17. run ./test_mmap m
    18. test_mmap_read elapsed milliseconds: 489
    19. size=10485760000
    20. Performance counter stats for './test_mmap m':
    21. 62,951,973 cache-references # 127.169 M/sec
    22. 18,601,851 cache-misses # 29.549 % of all cache refs
    23. 1,502,833,265 instructions
    24. 147,082,645 branches # 297.121 M/sec
    25. 24,354 branch-misses # 0.02% of all branches
    26. 1,718 faults # 0.003 M/sec
    27. 8 context-switches # 0.016 K/sec
    28. 495.025615 task-clock (msec) # 0.995 CPUs utilized
    29. 0 migrations # 0.000 K/sec
    30. 0.497508855 seconds time elapsed

     

     

     利用缓存命中提高性能

    cache-misses

    • 小结构体遍历: test_cache_miss s
    • 大结构体遍历: test_cache_miss b
    • 数组遍历:  test_cache_miss a
    • vector遍历: test_cache_miss v
    • 按行顺序遍历: test_cache_miss a
    • 按列顺序遍历:test_cache_miss i
    1. #include
    2. #include
    3. #include
    4. #include
    5. //#pragma pack(64)
    6. typedef struct {
    7. int cnt;
    8. std::string str;
    9. char resv[24];
    10. }algin_data_t __attribute__((aligned(64)));
    11. //#pragma pack()
    12. typedef struct {
    13. int cnt;
    14. std::string str;
    15. }small_data_t;
    16. typedef struct {
    17. int cnt;
    18. char str[128];
    19. }big_data_t;
    20. static void test_small_struct()
    21. {
    22. std::vector<small_data_t> dat(10000);
    23. int num=0;
    24. for (int i = 0; i < 10000; i++) {
    25. for (auto &e:dat) {
    26. e.cnt = num++;
    27. }
    28. }
    29. }
    30. static void test_algin_struct()
    31. {
    32. std::vector<algin_data_t> dat(10000);
    33. int num=0;
    34. for (int i = 0; i < 10000; i++) {
    35. for (auto &e:dat) {
    36. e.cnt = num++;
    37. }
    38. }
    39. }
    40. static void test_big_struct()
    41. {
    42. std::vector<big_data_t> dat(10000);
    43. int num=0;
    44. for (int i = 0; i < 10000; i++) {
    45. for (auto &e:dat) {
    46. e.cnt = num++;
    47. }
    48. }
    49. }
    50. static void test_vector_foreach()
    51. {
    52. std::vectorint>> dat(10000, std::vector<int>(64, 0));
    53. int num=0;
    54. for (int cnt = 0; cnt < 100; cnt++) {
    55. for (int i = 0; i < 10000; i++) {
    56. for (int j = 0; j < 64; j++) {
    57. dat[i][j] = num++;
    58. }
    59. }
    60. }
    61. }
    62. static int dat_array[10000][64];
    63. static void test_array_foreach()
    64. {
    65. int num=0;
    66. for (int cnt = 0; cnt < 100; cnt++) {
    67. for (int i = 0; i < 10000; i++) {
    68. for (int j = 0; j < 64; j++) {
    69. dat_array[i][j] = num++;
    70. }
    71. }
    72. }
    73. }
    74. static void test_array_inv_foreach()
    75. {
    76. int num=0;
    77. for (int cnt = 0; cnt < 100; cnt++) {
    78. for (int i = 0; i < 64; i++) {
    79. for (int j = 0; j < 10000; j++) {
    80. dat_array[j][i] = num++;
    81. }
    82. }
    83. }
    84. }
    85. int main(int argc, char **argv)
    86. {
    87. if(argc <= 1) {
    88. printf("exe b or exe s\n");
    89. return 0;
    90. }
    91. printf("small=%d algin:%d\n", sizeof(small_data_t), sizeof(algin_data_t));
    92. switch (argv[1][0])
    93. {
    94. case 's': test_small_struct(); break;
    95. case 'b': test_big_struct(); break;
    96. case 'a': test_array_foreach(); break;
    97. case 'v': test_vector_foreach(); break;
    98. case 'i': test_array_inv_foreach(); break;
    99. case 'l': test_algin_struct(); break;
    100. default:break;
    101. }
    102. return 0;
    103. }
    1. ~/test_perf$ ./show_perf.sh ./test_branch_miss s
    2. run ./test_branch_miss s
    3. Performance counter stats for './test_branch_miss s':
    4. 35,237 cache-references # 1.035 M/sec
    5. 15,563 cache-misses # 44.167 % of all cache refs
    6. 127,110,597 instructions
    7. 31,639,756 branches # 929.747 M/sec
    8. 67,240 branch-misses # 0.21% of all branches
    9. 111 faults # 0.003 M/sec
    10. 0 context-switches # 0.000 K/sec
    11. 34.030495 task-clock (msec) # 0.969 CPUs utilized
    12. 0 migrations # 0.000 K/sec
    13. 0.035105508 seconds time elapsed
    14. ~/test_perf$ ./show_perf.sh ./test_branch_miss b
    15. run ./test_branch_miss b
    16. Performance counter stats for './test_branch_miss b':
    17. 38,823 cache-references # 20.861 M/sec
    18. 12,309 cache-misses # 31.705 % of all cache refs
    19. 2,843,103 instructions
    20. 485,745 branches # 261.009 M/sec
    21. 15,449 branch-misses # 3.18% of all branches
    22. 102 faults # 0.055 M/sec
    23. 0 context-switches # 0.000 K/sec
    24. 1.861031 task-clock (msec) # 0.679 CPUs utilized
    25. 0 migrations # 0.000 K/sec
    26. 0.002742285 seconds time elapsed
    27. ~/test_perf$ ./show_perf.sh ./test_branch_miss a
    28. run ./test_branch_miss a
    29. Performance counter stats for './test_branch_miss a':
    30. 45,370 cache-references # 0.627 M/sec
    31. 15,784 cache-misses # 34.790 % of all cache refs
    32. 121,528,996 instructions
    33. 30,681,381 branches # 423.988 M/sec
    34. 4,005,835 branch-misses # 13.06% of all branches
    35. 111 faults # 0.002 M/sec
    36. 2 context-switches # 0.028 K/sec
    37. 72.363755 task-clock (msec) # 0.975 CPUs utilized
    38. 0 migrations # 0.000 K/sec
    39. 0.074219125 seconds time elapsed
    40. ~/test_perf$ ./show_perf.sh ./test_branch_miss v
    41. run ./test_branch_miss v
    42. Performance counter stats for './test_branch_miss v':
    43. 31,512 cache-references # 9.504 M/sec
    44. 12,034 cache-misses # 38.189 % of all cache refs
    45. 2,857,202 instructions
    46. 488,248 branches # 147.256 M/sec
    47. 14,946 branch-misses # 3.06% of all branches
    48. 103 faults # 0.031 M/sec
    49. 0 context-switches # 0.000 K/sec
    50. 3.315638 task-clock (msec) # 0.678 CPUs utilized
    51. 0 migrations # 0.000 K/sec
    52. 0.004890918 seconds time elapsed
    53. ~/test_perf$ ./show_perf.sh ./test_branch_miss i
    54. run ./test_branch_miss i
    55. Performance counter stats for './test_branch_miss i':
    56. 31,792 cache-references # 13.768 M/sec
    57. 13,951 cache-misses # 43.882 % of all cache refs
    58. 2,864,321 instructions
    59. 489,395 branches # 211.938 M/sec
    60. 14,898 branch-misses # 3.04% of all branches
    61. 103 faults # 0.045 M/sec
    62. 0 context-switches # 0.000 K/sec
    63. 2.309145 task-clock (msec) # 0.704 CPUs utilized
    64. 0 migrations # 0.000 K/sec
    65. 0.003280504 seconds time elapsed
    1. #!/bin/sh
    2. echo "run $* "
    3. perf stat -B -e cache-references,cache-misses,instructions,branches,branch-misses,faults,context-switches,task-clock,migrations $*
    • 热点数据紧凑合并 :  ./test_together
    • 热点数据分散合并 : ./test_together
    1. #include
    2. #include
    3. #include
    4. #include
    5. #include
    6. #include
    7. #include
    8. #include
    9. #include
    10. #include
    11. #include
    12. #include
    13. #include
    14. #include
    15. #include
    16. static const int array_size = 1024;
    17. struct Dat_One{
    18. int buff_one[array_size];
    19. int buffer_two[array_size];
    20. };
    21. static Dat_One dat_one;
    22. struct Dat_Two{
    23. int buff_one;
    24. int buffer_two;
    25. };
    26. static Dat_Two dat_two[array_size];
    27. static int do_func(int dat)
    28. {
    29. static int sum = 0;
    30. sum++;
    31. return dat;
    32. }
    33. static void test1()
    34. {
    35. auto start = std::chrono::high_resolution_clock::now();
    36. for (int n = 0; n < 10000; n++)
    37. for (int j = 0; j < array_size; j++) {
    38. dat_one.buffer_two[j] = do_func(dat_one.buff_one[j]);
    39. }
    40. auto end = std::chrono::high_resolution_clock::now();
    41. size_t use_time = std::chrono::duration_cast(end-start).count();
    42. std::cout << "big struct : elapsed milliseconds: " << use_time << std::endl;
    43. std::flush(std::cout);
    44. }
    45. static void test2()
    46. {
    47. auto start = std::chrono::high_resolution_clock::now();
    48. for (int n = 0; n < 10000; n++)
    49. for (int j = 0; j < array_size; j++) {
    50. dat_two[j].buffer_two= do_func(dat_two[j].buff_one);
    51. }
    52. auto end = std::chrono::high_resolution_clock::now();
    53. size_t use_time = std::chrono::duration_cast(end-start).count();
    54. std::cout << "small struct : elapsed milliseconds: " << use_time << std::endl;
    55. std::flush(std::cout);
    56. }
    57. int main(int argc, char **argv)
    58. {
    59. int cpu_index =0;
    60. if(argc >= 3) cpu_index=atoi(argv[2]);
    61. cpu_set_t set;
    62. CPU_ZERO(&set);
    63. CPU_SET(cpu_index, &set);
    64. sched_setaffinity(0, sizeof(set), &set);
    65. test1();
    66. test2();
    67. return 0;
    68. }

    • 增加无用数据减少cache冲突: test_Tmat 2 && test_Tmat 3
    1. #include
    2. #include
    3. #include
    4. #include
    5. #include
    6. #include
    7. #include
    8. #include
    9. #include
    10. #include
    11. #include
    12. #include
    13. #include
    14. #include
    15. #include
    16. //(critical stride) = (number of sets)*(line size) = (total cache size) / (number of ways).
    17. //: (set) = (memory address) / (line size) % (number of sets)
    18. //cat /sys/devices/system/cpu/cpu0/cache/index0/ways_of_associativity
    19. //cat /sys/devices/system/cpu/cpu0/cache/index0/number_of_sets
    20. //answer : https://stackoverflow.com/questions/11413855/why-is-transposing-a-matrix-of-512x512-much-slower-than-transposing-a-matrix-of
    21. //sudo perf stat -e cache-misses ./test_Tmat 2
    22. #define MATSIZE_512 (4096UL) //L1 cache miss
    23. #define MATSIZE_513 (4096UL+16)
    24. static void transpose_mat512()
    25. {
    26. const size_t alloc_size = MATSIZE_512*MATSIZE_512;
    27. std::cout << alloc_size << std::endl;
    28. char *total_mem = (char *)malloc(alloc_size);
    29. assert(total_mem != nullptr);
    30. char (*mat_512)[MATSIZE_512] = (char (*)[MATSIZE_512])total_mem;
    31. auto start = std::chrono::high_resolution_clock::now();
    32. for (int n = 0; n < 10; n++)
    33. for (int i = 1; i < MATSIZE_512; i++)
    34. for (int j = 0; j < i; j++) {
    35. int temp = mat_512[i][j];
    36. mat_512[i][j] = mat_512[j][i];
    37. mat_512[j][i] = temp;
    38. }
    39. auto end = std::chrono::high_resolution_clock::now();
    40. size_t use_time = std::chrono::duration_cast(end-start).count();
    41. std::cout << "mat512x512 elapsed milliseconds: " << use_time << std::endl;
    42. std::flush(std::cout);
    43. free(total_mem);
    44. }
    45. static void transpose_mat513()
    46. {
    47. const size_t alloc_size = MATSIZE_513*MATSIZE_513;
    48. char *total_mem = (char *)malloc(alloc_size);
    49. assert(total_mem != nullptr);
    50. char (*mat_513)[MATSIZE_513] = (char (*)[MATSIZE_513])total_mem;
    51. auto start = std::chrono::high_resolution_clock::now();
    52. for (int n = 0; n < 10; n++)
    53. for (int i = 1; i < MATSIZE_513; i++)
    54. for (int j = 0; j < i; j++) {
    55. int temp = mat_513[i][j];
    56. mat_513[i][j] = mat_513[j][i];
    57. mat_513[j][i] = temp;
    58. }
    59. auto end = std::chrono::high_resolution_clock::now();
    60. size_t use_time = std::chrono::duration_cast(end-start).count();
    61. std::cout << "mat513x513 elapsed milliseconds: " << use_time << std::endl;
    62. std::flush(std::cout);
    63. free( total_mem);
    64. }
    65. int main(int argc, char **argv)
    66. {
    67. int cpu_index =0;
    68. if(argc >= 3) cpu_index=atoi(argv[2]);
    69. cpu_set_t set;
    70. CPU_ZERO(&set);
    71. CPU_SET(cpu_index, &set);
    72. sched_setaffinity(0, sizeof(set), &set);
    73. switch(argv[1][0]) {
    74. case '2' : transpose_mat512();break;
    75. case '3' : transpose_mat513();break;
    76. }
    77. return 0;
    78. }
    1. ~/test_perf$ ./show_perf.sh ./test_Tmat 2
    2. run ./test_Tmat 2
    3. 16777216
    4. mat512x512 elapsed milliseconds: 2721
    5. Performance counter stats for './test_Tmat 2':
    6. 99,403,455 cache-references # 36.646 M/sec
    7. 55,481,043 cache-misses # 55.814 % of all cache refs
    8. 4,075,121,270 instructions
    9. 252,150,019 branches # 92.958 M/sec
    10. 728,994 branch-misses # 0.29% of all branches
    11. 8,297 faults # 0.003 M/sec
    12. 20 context-switches # 0.007 K/sec
    13. 2712.515229 task-clock (msec) # 0.995 CPUs utilized
    14. 1 migrations # 0.000 K/sec
    15. 2.725778361 seconds time elapsed
    16. ~/test_perf$ ./show_perf.sh ./test_Tmat 3
    17. run ./test_Tmat 3
    18. mat513x513 elapsed milliseconds: 816
    19. Performance counter stats for './test_Tmat 3':
    20. 4,054,160 cache-references # 4.947 M/sec
    21. 2,411,165 cache-misses # 59.474 % of all cache refs
    22. 4,499,294,985 instructions
    23. 172,637,350 branches # 210.638 M/sec
    24. 73,871 branch-misses # 0.04% of all branches
    25. 8,360 faults # 0.010 M/sec
    26. 5 context-switches # 0.006 K/sec
    27. 819.594449 task-clock (msec) # 0.996 CPUs utilized
    28. 1 migrations # 0.001 K/sec
    29. 0.822494459 seconds time elapsed

    branch-misses

    • 随机数组遍历条件执行: test_branch_miss a
    • 随机数组排序后遍历条件执行: test_branch_miss s
    • likely和unlikely干预分支预测: test_branch_miss n  ;  test_branch_miss l
    1. #include
    2. #include
    3. #include
    4. #include
    5. #include
    6. #define likely(x) __builtin_expect(!!(x), 1)
    7. #define unlikely(x) __builtin_expect(!!(x), 0)
    8. static int dat_array[10000];
    9. static void test_array_foreach()
    10. {
    11. int num=0;
    12. for (int i = 0; i < 10000; i++) {
    13. dat_array[i]=rand()%512;
    14. }
    15. for (int cnt = 0; cnt < 1000; cnt++) {
    16. for (int i = 0; i < 10000; i++) {
    17. if( dat_array[i] > 256) {
    18. dat_array[i]=num++;
    19. }
    20. }
    21. }
    22. }
    23. volatile int test_a;
    24. static inline void test_no_use()
    25. {
    26. test_a = 1000;
    27. test_a += 1;
    28. test_a *= 10;
    29. test_a /= 3;
    30. test_a <<= 1;
    31. }
    32. static void test_array_sort_foreach()
    33. {
    34. int num=0;
    35. for (int i = 0; i < 10000; i++) {
    36. dat_array[i]=rand()%512;
    37. }
    38. std::sort(dat_array, dat_array+10000);
    39. for (int cnt = 0; cnt < 1000; cnt++) {
    40. for (int i = 0; i < 10000; i++) {
    41. if( dat_array[i] > 256) {
    42. dat_array[i]=num++;
    43. }
    44. }
    45. }
    46. }
    47. static void test_nolikely_foreach()
    48. {
    49. int num=0;
    50. for (int i = 0; i < 10000; i++) {
    51. dat_array[i]=rand()%512;
    52. }
    53. for (int cnt = 0; cnt < 10000; cnt++) {
    54. for (int i = 0; i < 10000; i++) {
    55. if( dat_array[i] > 50) {
    56. dat_array[i]=num;
    57. num+=2;
    58. test_no_use();
    59. }else {
    60. dat_array[i]=num;
    61. num++;
    62. }
    63. }
    64. }
    65. }
    66. static void test_likely_foreach()
    67. {
    68. int num=0;
    69. for (int i = 0; i < 100000; i++) {
    70. dat_array[i]=rand()%512;
    71. }
    72. for (int cnt = 0; cnt < 1000; cnt++) {
    73. for (int i = 0; i < 10000; i++) {
    74. if(likely( dat_array[i] > 50)) {
    75. dat_array[i]=num;
    76. num+=2;
    77. test_no_use();
    78. }else {
    79. dat_array[i]=num;
    80. num++;
    81. }
    82. }
    83. }
    84. }
    85. int main(int argc, char **argv)
    86. {
    87. if(argc <= 1) {
    88. printf("exe b or exe s\n");
    89. return 0;
    90. }
    91. switch (argv[1][0])
    92. {
    93. case 's': test_array_sort_foreach(); break;
    94. case 'a': test_array_foreach(); break;
    95. case 'n': test_nolikely_foreach(); break;
    96. case 'l': test_likely_foreach(); break;
    97. default:break;
    98. }
    99. return 0;
    100. }
    1. ~/test_perf$ ./show_perf.sh ./test_branch_miss a
    2. run ./test_branch_miss a
    3. Performance counter stats for './test_branch_miss a':
    4. 37,087 cache-references # 0.527 M/sec
    5. 13,650 cache-misses # 36.805 % of all cache refs
    6. 121,502,626 instructions
    7. 30,676,994 branches # 435.643 M/sec
    8. 4,005,535 branch-misses # 13.06% of all branches
    9. 110 faults # 0.002 M/sec
    10. 0 context-switches # 0.000 K/sec
    11. 70.417747 task-clock (msec) # 0.979 CPUs utilized
    12. 0 migrations # 0.000 K/sec
    13. 0.071961280 seconds time elapsed
    14. ~/test_perf$ ./show_perf.sh ./test_branch_miss s
    15. run ./test_branch_miss s
    16. Performance counter stats for './test_branch_miss s':
    17. 34,839 cache-references # 0.856 M/sec
    18. 15,265 cache-misses # 43.816 % of all cache refs
    19. 127,107,931 instructions
    20. 31,638,325 branches # 777.609 M/sec
    21. 67,152 branch-misses # 0.21% of all branches
    22. 112 faults # 0.003 M/sec
    23. 0 context-switches # 0.000 K/sec
    24. 40.686661 task-clock (msec) # 0.966 CPUs utilized
    25. 0 migrations # 0.000 K/sec
    26. 0.042110582 seconds time elapsed
    27. ~/test_perf$ ./show_perf.sh ./test_branch_miss n
    28. run ./test_branch_miss n
    29. Performance counter stats for './test_branch_miss n':
    30. 46,054 cache-references # 0.111 M/sec
    31. 20,105 cache-misses # 43.655 % of all cache refs
    32. 4,304,097,649 instructions
    33. 600,785,471 branches # 1452.078 M/sec
    34. 34,768 branch-misses # 0.01% of all branches
    35. 110 faults # 0.266 K/sec
    36. 7 context-switches # 0.017 K/sec
    37. 413.741881 task-clock (msec) # 0.994 CPUs utilized
    38. 0 migrations # 0.000 K/sec
    39. 0.416110095 seconds time elapsed
    40. ~/test_perf$ ./show_perf.sh ./test_branch_miss l
    41. run ./test_branch_miss l
    42. ./test_branch_miss: Segmentation fault
    43. Performance counter stats for './test_branch_miss l':
    44. 41,754 cache-references # 9.202 M/sec
    45. 15,556 cache-misses # 37.256 % of all cache refs
    46. 3,663,840 instructions
    47. 688,072 branches # 151.649 M/sec
    48. 16,502 branch-misses # 2.40% of all branches
    49. 109 faults # 0.024 M/sec
    50. 2 context-switches # 0.441 K/sec
    51. 4.537274 task-clock (msec) # 0.035 CPUs utilized
    52. 0 migrations # 0.000 K/sec
    53. 0.129464292 seconds time elapsed

    利用cpu乱序执行与内存、指令重排提高性能

    • cpu乱序执行 (无数据依赖遍历与有数据依赖遍历): test_reorder 4 ; test_reorder 5
    • volatile原理: mfence
    • 原子操作原理: 
    • cpu指令重排带来的多线程问题:  test_reorder 1 ;  test_reorder 2
    • 内存重排带来的无锁编程问题
    1. #include
    2. #include
    3. #include
    4. #include
    5. #include
    6. #include
    7. #include
    8. #include
    9. #include
    10. #include
    11. #include
    12. #include
    13. #include
    14. #include
    15. #include
    16. #include
    17. #include
    18. #include
    19. #define barrier() __asm__ __volatile__("": : :"memory")
    20. #define mfence() __asm__ __volatile__("mfence": : :"memory")
    21. //https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Consume_ordering
    22. static sem_t sem[2];
    23. static int X, Y;
    24. static int r1, r2;
    25. static void t1_run(void)
    26. {
    27. sem_wait(&sem[0]);
    28. X=1;
    29. //mfence();
    30. r1=Y;
    31. }
    32. static void t2_run(void)
    33. {
    34. sem_wait(&sem[1]);
    35. Y=1;
    36. //mfence();
    37. r2=X;
    38. }
    39. static void test_order_once()
    40. {
    41. static int num = 0;
    42. static int round = 0;
    43. round++;
    44. X = 0;
    45. Y = 0;
    46. std::thread t1(t1_run);
    47. std::thread t2(t2_run);
    48. sem_post(&sem[0]);
    49. sem_post(&sem[1]);
    50. t1.join();
    51. t2.join();
    52. if ((r1 == 0) && (r2 == 0)) {
    53. num++;
    54. printf("reorder: %d, iterator: %d\n", num, round);
    55. }
    56. }
    57. /
    58. static std::atomic<int> aX;
    59. static std::atomic<int> aY;
    60. static int err_num = 0;
    61. static void load_aY_to_aX()
    62. {
    63. sem_wait(&sem[0]);
    64. r1 = aY.load(std::memory_order_relaxed);
    65. aX.store(r1, std::memory_order_relaxed);
    66. }
    67. static void load_aX_then_store_aY()
    68. {
    69. sem_wait(&sem[1]);
    70. r2 = aX.load(std::memory_order_relaxed);
    71. aY.store(42, std::memory_order_relaxed);
    72. }
    73. static void test_order_relaxed_once()
    74. {
    75. static int num = 0;
    76. static int round = 0;
    77. round++;
    78. aX = 0;
    79. aY = 0;
    80. std::thread t1(load_aY_to_aX);
    81. std::thread t2(load_aX_then_store_aY);
    82. sem_post(&sem[0]);
    83. sem_post(&sem[1]);
    84. t1.join();
    85. t2.join();
    86. if ((r1 == 42) && (r2 ==42)) {
    87. printf("reorder: %d, iterator: %d\n", ++num, round);
    88. }
    89. }
    90. /
    91. static std::atomic ptr;
    92. static int data;
    93. static void producer()
    94. {
    95. std::string* p = new std::string("Hello");
    96. data = 42;
    97. ptr.store(p, std::memory_order_release);
    98. }
    99. static void consumer()
    100. {
    101. std::string* p2;
    102. while (!(p2 = ptr.load(std::memory_order_acquire)));
    103. assert(*p2 == "Hello"); // never fires
    104. assert(data == 42); // never fires
    105. }
    106. static void relese_acquire_test()
    107. {
    108. std::thread t1(producer);
    109. std::thread t2(consumer);
    110. t1.join();
    111. t2.join();
    112. }
    113. static void test_auto_add()
    114. {
    115. const int size = 1000;
    116. float list[size], sum = 0; int i;
    117. auto start = std::chrono::high_resolution_clock::now();
    118. for (int n = 0; n < 100000; n++)
    119. for (i = 0; i < size; i++) sum += list[i];
    120. auto end = std::chrono::high_resolution_clock::now();
    121. size_t use_time = std::chrono::duration_cast(end-start).count();
    122. std::cout << "test_auto_add elapsed milliseconds: " << use_time << std::endl;
    123. std::flush(std::cout);
    124. }
    125. static void test_dependency_add()
    126. {
    127. const int size = 1000;
    128. float list[size], sum1 = 0, sum2 = 0;
    129. int i;
    130. auto start = std::chrono::high_resolution_clock::now();
    131. for (int n = 0; n < 100000; n++) {
    132. for (i = 0; i < size; i+=2) {
    133. sum1 += list[i];
    134. sum2 += list[i+1];
    135. }
    136. sum1 += sum2;
    137. }
    138. auto end = std::chrono::high_resolution_clock::now();
    139. size_t use_time = std::chrono::duration_cast(end-start).count();
    140. std::cout << "test_auto_add elapsed milliseconds: " << use_time << std::endl;
    141. std::flush(std::cout);
    142. }
    143. int main(int argc, char **argv)
    144. {
    145. sem_init(&sem[0], 0, 0);
    146. sem_init(&sem[1], 0, 0);
    147. switch(argv[1][0]) {
    148. case '1' : while(true) test_order_once(); break;
    149. case '2' : while(true) test_order_relaxed_once();;break;
    150. case '3' : while(true) relese_acquire_test();;break;
    151. case '4' : test_auto_add();;break;
    152. case '5' : test_dependency_add();;break;
    153. }
    154. return 0;
    155. }
    1. ~/test_perf$ ./show_perf.sh ./test_reorder 4
    2. run ./test_reorder 4
    3. test_auto_add elapsed milliseconds: 310
    4. Performance counter stats for './test_reorder 4':
    5. 41,274 cache-references # 0.132 M/sec
    6. 21,390 cache-misses # 51.824 % of all cache refs
    7. 1,004,169,232 instructions
    8. 200,905,980 branches # 640.344 M/sec
    9. 118,085 branch-misses # 0.06% of all branches
    10. 113 faults # 0.360 K/sec
    11. 0 context-switches # 0.000 K/sec
    12. 313.746779 task-clock (msec) # 0.995 CPUs utilized
    13. 0 migrations # 0.000 K/sec
    14. 0.315352555 seconds time elapsed
    15. ~/test_perf$ ./show_perf.sh ./test_reorder 5
    16. run ./test_reorder 5
    17. test_auto_add elapsed milliseconds: 159
    18. Performance counter stats for './test_reorder 5':
    19. 38,816 cache-references # 0.238 M/sec
    20. 18,949 cache-misses # 48.817 % of all cache refs
    21. 854,311,607 instructions
    22. 100,877,665 branches # 617.780 M/sec
    23. 117,673 branch-misses # 0.12% of all branches
    24. 113 faults # 0.692 K/sec
    25. 0 context-switches # 0.000 K/sec
    26. 163.290585 task-clock (msec) # 0.992 CPUs utilized
    27. 0 migrations # 0.000 K/sec
    28. 0.164629185 seconds time elapsed

    至于以上实验为什么会有这些测试结果。后续补上 ,先上班~~~~~~ 

  • 相关阅读:
    工作中使用Redis的10种场景
    团队建设游戏分享
    在Linux系统上实现区域更改
    mybatis开启缓存cache
    数据结构 —— 堆(超详细图解 & 接口函数实现)
    从零开始学YC-Framework之鉴权
    linux驱动32:延迟执行
    OpenFeign简介,负载策略
    HDRP Water & 云影
    快递保卫战:找出北极星指标、锁定优质供应商
  • 原文地址:https://blog.csdn.net/sdewenking/article/details/126655606