• 解读OOM killer机制输出的日志


    1.概述

    • 在业务繁忙的服务器,小内存方案,内存泄露的场景中,我们常常发现系统在非常大的内存压力情况下触发了OOM killer机制。OOM killer机制是内存管理中在资源极端短缺情况下一种迫不得已的进程终止机制。OOM killer机制会根据算法选择并终止占用内存资源比较多的进程,以释放内存。

    2.输出日志

     1 [20220707-022639][23457.059849] ptz_task invoked oom-killer: gfp_mask=0x24200ca(GFP_HIGHUSER_MOVABLE), nodemask=0, order=0, oom_score_adj=0
     2 [20220707-022639][23457.079399] CPU: 0 PID: 732 Comm: ptz_task Not tainted 4.9.191 #4
     3 [20220707-022639][23457.104264] Hardware name: sun8iw21
     4 [20220707-022639][23457.108272] Function entered at [<c0010b65>] from [<c000f42f>]
     5 [20220707-022639][23457.116676] Function entered at [<c000f42f>] from [<c007353d>]
     6 [20220707-022639][23457.123596] Function entered at [<c007353d>] from [<c0050373>]
     7 [20220707-022639][23457.135473] Function entered at [<c0050373>] from [<c005095f>]
     8 [20220707-022639][23457.142038] Function entered at [<c005095f>] from [<c0052ec3>]
     9 [20220707-022639][23457.154611] Function entered at [<c0052ec3>] from [<c004d39d>]
    10 [20220707-022639][23457.161183] Function entered at [<c004d39d>] from [<c004e217>]
    11 [20220707-022639][23457.177588] Function entered at [<c004e217>] from [<c0063f73>]
    12 [20220707-022639][23457.184416] Function entered at [<c0063f73>] from [<c0065fab>]
    13 [20220707-022639][23457.191060] Function entered at [<c0065fab>] from [<c00132a1>]
    14 [20220707-022639][23457.213017] Function entered at [<c00132a1>] from [<c0009255>]
    15 [20220707-022639][23457.230228] Function entered at [<c0009255>] from [<c00100fb>]
    16 [20220707-022639][23457.239501] Exception stack(0xc12a3fb0 to 0xc12a3ff8)
    17 [20220707-022639][23457.254107] 3fa0:                                     00000000 00000000 00000000 00000000
    18 [20220707-022639][23457.263337] 3fc0: 00000000 00000000 00000000 00000000 00635a4c 00635b64 00000000 00000000
    19 [20220707-022639][23457.292522] 3fe0: b54d7c40 b54d7cb8 00080810 0007f304 00000010 ffffffff
    20 [20220707-022639][23457.305213] Mem-Info:
    21 [20220707-022639][23457.307783] active_anon:4384 inactive_anon:4 isolated_anon:0
    22 [20220707-022639][23457.307783]  active_file:662 inactive_file:732 isolated_file:64
    23 [20220707-022639][23457.307783]  unevictable:9 dirty:0 writeback:0 unstable:0
    24 [20220707-022639][23457.307783]  slab_reclaimable:253 slab_unreclaimable:1373
    25 [20220707-022639][23457.307783]  mapped:227 shmem:86 pagetables:79 bounce:0
    26 [20220707-022639][23457.307783]  free:127 free_pcp:2 free_cma:0
    27 [20220707-022639][23457.383542] Node 0 active_anon:17536kB inactive_anon:16kB active_file:2648kBinactive_file:2876kB unevictable:36kB isolated(anon):0kB isolated(file):308kB mapped:908kB dirty:0kB writeback:0kB shmem:344kB writeback_tmp:0kB unstable:0kB pages_scanned:56785 all_unreclaimable? no
    28 [20220707-022639][23457.430530] Normal free:508kB min:512kB low:1124kB high:1252kB active_anon:17536kB inactive_anon:16kB active_file:2648kB inactive_file:2928kB unevictable:36kB writepending:0kB present:65536kBmanaged:61480kB mlocked:36kB slab_reclaimable:1012kB slab_unreclaimable:5492 kB kernel_stack:1040kB pagetables:316kB bounce:0kB free_pcp:8kB local_pcp:8kB free_cma:0kB
    29 [20220707-022639][23457.496173] lowmem_reserve[]: 0 0
    30 [20220707-022639][23457.500030] Normal: 109*4kB (UME) 9*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 508kB
    31 [20220707-022639][23457.524995] 1544 total pagecache pages
    32 [20220707-022639][23457.529288] 16384 pages RAM
    33 [20220707-022639][23457.532556] 0 pages HighMem/MovableOnly
    34 [20220707-022639][23457.547845] 1014 pages reserved
    35 [20220707-022639][23457.551958] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
    36 [20220707-022639][23457.577207] [  632]     0   632      254       41       3       0        0             0 telnetd
    37 [20220707-022639][23457.587485] [  633]     0   633      232       43       3       0        0             0 adbd
    38 [20220707-022639][23457.607393] [  657]     0   657      585       72       4       0        0             0 wpa_supplicant
    39 [20220707-022639][23457.628416] [  660]     0   660      254       36       3       0        0             0 telnetd
    40 [20220707-022639][23457.641304] [  673]     0   673      244       48       3       0        0             0 sk_srv_syscall
    41 [20220707-022639][23457.662083] [  676]     0   676    17203     4349      35       0        0             0 skyapp
    42 [20220707-022639][23457.682194] [  677]     0   677      254       41       3       0        0             0 exe
    43 [20220707-022639][23457.698911] [  899]     0   899     1056       97       6       0        0             0 skmongoose
    44 [20220707-022639][23457.712344] [  900]     0   900      254       41       3       0        0             0 clearCached.sh
    45 [20220707-022639][23457.733172] [  921]     0   921      254       33       3       0        0             0 udhcpc
    46 [20220707-022639][23457.752165] [ 1210]     0  1210      254       42       3       0        0             0 ash
    47 [20220707-022639][23457.762516] [11670]     0 11670      254       41       3       0        0             0 sleep
    48 [20220707-022639][23457.773158] [12021]     0 12021      228       24       3       0        0             0 sleep
    49 [20220707-022639][23457.788033] [12023]     0 12023      244       48       3       0        0             0 sk_srv_syscall
    50 [20220707-022639][23457.819352] Out of memory: Kill process 676 (skyapp) score 276 or sacrifice child
    51 [20220707-022639][23457.838273] Killed process 676 (skyapp) total-vm:68812kB, anon-rss:16536kB, file-rss:820kB, shmem-rss:40kB
    52 [20220707-022639][23457.882911] oom_reaper: reaped process 676 (skyapp), now anon-rss:0kB, file-rss:0kB, shmem-rss:40kB
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52

    2.1 OOM killer 日志片段1(输出进程请求页面信息&调用栈)

     1 [20220707-022639][23457.059849] ptz_task invoked oom-killer: gfp_mask=0x24200ca(GFP_HIGHUSER_MOVABLE), nodemask=0, order=0, oom_score_adj=0
     2 [20220707-022639][23457.079399] CPU: 0 PID: 732 Comm: ptz_task Not tainted 4.9.191 #4
     3 [20220707-022639][23457.104264] Hardware name: sun8iw21
     4 [20220707-022639][23457.108272] Function entered at [<c0010b65>] from [<c000f42f>]
     5 [20220707-022639][23457.116676] Function entered at [<c000f42f>] from [<c007353d>]
     6 [20220707-022639][23457.123596] Function entered at [<c007353d>] from [<c0050373>]
     7 [20220707-022639][23457.135473] Function entered at [<c0050373>] from [<c005095f>]
     8 [20220707-022639][23457.142038] Function entered at [<c005095f>] from [<c0052ec3>]
     9 [20220707-022639][23457.154611] Function entered at [<c0052ec3>] from [<c004d39d>]
    10 [20220707-022639][23457.161183] Function entered at [<c004d39d>] from [<c004e217>]
    11 [20220707-022639][23457.177588] Function entered at [<c004e217>] from [<c0063f73>]
    12 [20220707-022639][23457.184416] Function entered at [<c0063f73>] from [<c0065fab>]
    13 [20220707-022639][23457.191060] Function entered at [<c0065fab>] from [<c00132a1>]
    14 [20220707-022639][23457.213017] Function entered at [<c00132a1>] from [<c0009255>]
    15 [20220707-022639][23457.230228] Function entered at [<c0009255>] from [<c00100fb>]
    16 [20220707-022639][23457.239501] Exception stack(0xc12a3fb0 to 0xc12a3ff8)
    17 [20220707-022639][23457.254107] 3fa0:                                     00000000 00000000 00000000 00000000
    18 [20220707-022639][23457.263337] 3fc0: 00000000 00000000 00000000 00000000 00635a4c 00635b64 00000000 00000000
    19 [20220707-022639][23457.292522] 3fe0: b54d7c40 b54d7cb8 00080810 0007f304 00000010 ffffffff
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 第1行,说明触发进程为ptz_task,它在访问某个文件时触发了缺页异常。在缺页异常处理过程中,调用alloc_pages()函数分配物理内存,又因为内存短缺而触发OOM Killer机制,其分配的掩码是GFP_HIGHUSER_MOVABLE,请求分配的内存为1个页面,即order=0。进程的OOM等于0即oom_score_adj=0。
    • 第2~19行,触发内核函数调用栈,信息包括:
      • 第2~3行,运行在CPU 0,PID=732,进程名为ptz_task,的进程触发,内核版本号为:4.9.191。硬件名:sun8iw21。
      • 第4~15行,内核函数调用栈信息,注意需要打开CONFIG_KALLSYMS才能看到函数名,不然只能看到地址。
      • 第16~19行,输出异常时的堆栈信息。

    2.2 OOM killer 日志片段2(输出系统内存总体信息)

    20 [20220707-022639][23457.305213] Mem-Info:
    21 [20220707-022639][23457.307783] active_anon:4384 inactive_anon:4 isolated_anon:0
    22 [20220707-022639][23457.307783]  active_file:662 inactive_file:732 isolated_file:64
    23 [20220707-022639][23457.307783]  unevictable:9 dirty:0 writeback:0 unstable:0
    24 [20220707-022639][23457.307783]  slab_reclaimable:253 slab_unreclaimable:1373
    25 [20220707-022639][23457.307783]  mapped:227 shmem:86 pagetables:79 bounce:0
    26 [20220707-022639][23457.307783]  free:127 free_pcp:2 free_cma:0
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 第20行,“Mem-Info:”表示将要输出系统内存的总体信息。在show_mem()——>show_free_areas()实现。
    • 第21行,活跃匿名页面数量(active_anon:4384),不活跃匿名页面数量(inactive_anon:4),处于临时分离状态的匿名页面数量(isolated_anon:0)
    • 第22行,活跃文件映射页面数量(active_file:662),不活跃文件映射页面数量(inactive_file:732),处于临时分离状态的文件映射页面数量(isolated_file:64)
    • 第23行,不可回收的页面数量(unevictable:9),文件映射的脏页数量(dirty:0),正在会写的页面数量(writeback:0)NFS中不稳定的页面数量(unstable:0)
    • 第24行,可回收的slab页面数量(slab_reclaimable:253),不可回收slab页面数量(slab_unreclaimable:1373)
    • 第25行,映射到用户进程地址空间的文件映射页面数量(mapped:227),共享内存页面数量(shmem:86),用于页表的页面数量(pagetables:79),跳跃的页面数量(bounce:0)
    • 第26行,空闲页面数量(free:127),用于PCP机制的页面数量(free_pcp:2),用于CMA机制的空闲页面数量(free_cma:0)

    2.3 OOM killer 日志片段3(输出内存节点的相关信息)

    27 [20220707-022639][23457.383542] Node 0 active_anon:17536kB inactive_anon:16kB active_file:2648kBinactive_file:2876kB unevictable:36kB isolated(anon):0kB isolated(file):308kB mapped:908kB dirty:0kB writeback:0kB shmem:344kB writeback_tmp:0kB unstable:0kB pages_scanned:56785 all_unreclaimable? no
    
    • 1
    • 第27行,与第21~26行对应只是单位是page,而27行单位是kB。

    2.4 OOM killer 日志片段4(输出内存管理区的相关信息)

    28 [20220707-022639][23457.430530] Normal free:508kB min:512kB low:1124kB high:1252kB active_anon:17536kB inactive_anon:16kB active_file:2648kB inactive_file:2928kB unevictable:36kB writepending:0kB present:65536kBmanaged:61480kB mlocked:36kB slab_reclaimable:1012kB slab_unreclaimable:5492 kB kernel_stack:1040kB pagetables:316kB bounce:0kB free_pcp:8kB local_pcp:8kB free_cma:0kB
    29 [20220707-022639][23457.496173] lowmem_reserve[]: 0 0
    30 [20220707-022639][23457.500030] Normal: 109*4kB (UME) 9*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 508kB
    
    • 1
    • 2
    • 3
    • 第28行,输出内存管理区中的空闲页面数量

      • 输出ZONE_NORMAL的空闲页面大小:free:508kB。
      • 输出内存管理区中水位情况,包括最低警戒水位、低水位以及高水位情况。上述日志信息里显示ZONE_NORMAL的最低警戒水位(min:512kB),低水位(low:1124kB),高水位(high:1252kB)。显然内存管理区的当前水位已经低于最低警戒水位了,因此一个页面也分配不出来,从而触发OOM Killer机制。
      • 输出内存管理区中相关的内存信息(active_anon:17536kB inactive_anon:16kB active_file:2648kB inactive_file:2928kB unevictable:36kB writepending:0kB present:65536kBmanaged:61480kB mlocked:36kB)。这里不赘述。
    • 第29行,输出内存管理区中的lowmem_reserve成员的值(lowmem_reserve[]: 0 0),它用于保证低端内存管理区的内存不会被进程恶意占用。

    • 第30行,输出内存管理区中每个链表中的迁移类型情况,如“109*4kB (UME)”表示order为0的空闲链表中,共109个成员。累加后总大小508kB。

      • U表示不可迁移类型(MIGRATE_UNMOVABLE),页框中的内容不可移动。内核分配的大部分内存属于这种类型。
      • M表示可迁移类型(MIGRATE_MOVABLE),页框中的内容可移动。用户空间分配的内存大部分属于这种类型。移动页框的方法为将老页框中的内容拷贝到新页框中,然后将页表的pte entry指向新页框地址即可。
      • E表示可回收类型(MIGRATE_RECLAIMABLE),这种页框一般用于文件缓存,它们虽然不能被移动,但可以被回收,当再次需要时可通过源文件重建。
      • C表示(MIGRATE_CMA),这种内存被预留给驱动使用,当驱动未使用时,伙伴系统可以将其分配给用户进程使用。但驱动需要时,就需要用户进程通过回收或迁移的方式返还它们,以供驱动使用,因此它必须是可迁移或可回收的。
      • H表示(MIGRATE_HIGHATOMIC),它是为那些不允许内存分配失败情形预留的紧急内存, 只有当gfp_flags为__GFP_ATOMIC时,才可能会从该内存池中分配。
      • I表示(MIGRATE_ISOLATE),它用于numa节点之间移动内存,从而将内存迁移到访问它最频繁cpu的node上。

    2.5 OOM killer 日志片段5(输出总体内存信息)

    31 [20220707-022639][23457.524995] 1544 total pagecache pages
    32 [20220707-022639][23457.529288] 16384 pages RAM
    33 [20220707-022639][23457.532556] 0 pages HighMem/MovableOnly
    34 [20220707-022639][23457.547845] 1014 pages reserved
    
    • 1
    • 2
    • 3
    • 4
    • 第31行,内存缓存(pagecache)页面数量为1544。
    • 第32行,系统物理内存一共有16384个页面,共占用64MB内存。
    • 第33行,高端内存0个页面。
    • 第34行,系统预留页面:1014,共4056kB。

    2.6 OOM killer 日志片段6(输出可选的进程的相关信息)

    35 [20220707-022639][23457.551958] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
    36 [20220707-022639][23457.577207] [  632]     0   632      254       41       3       0        0             0 telnetd
    37 [20220707-022639][23457.587485] [  633]     0   633      232       43       3       0        0             0 adbd
    38 [20220707-022639][23457.607393] [  657]     0   657      585       72       4       0        0             0 wpa_supplicant
    39 [20220707-022639][23457.628416] [  660]     0   660      254       36       3       0        0             0 telnetd
    40 [20220707-022639][23457.641304] [  673]     0   673      244       48       3       0        0             0 sk_srv_syscall
    41 [20220707-022639][23457.662083] [  676]     0   676    17203     4349      35       0        0             0 skyapp
    42 [20220707-022639][23457.682194] [  677]     0   677      254       41       3       0        0             0 exe
    43 [20220707-022639][23457.698911] [  899]     0   899     1056       97       6       0        0             0 skmongoose
    44 [20220707-022639][23457.712344] [  900]     0   900      254       41       3       0        0             0 clearCached.sh
    45 [20220707-022639][23457.733172] [  921]     0   921      254       33       3       0        0             0 udhcpc
    46 [20220707-022639][23457.752165] [ 1210]     0  1210      254       42       3       0        0             0 ash
    47 [20220707-022639][23457.762516] [11670]     0 11670      254       41       3       0        0             0 sleep
    48 [20220707-022639][23457.773158] [12021]     0 12021      228       24       3       0        0             0 sleep
    49 [20220707-022639][23457.788033] [12023]     0 12023      244       48       3       0        0             0 sk_srv_syscall
    50 [20220707-022639][23457.819352] Out of memory: Kill process 676 (skyapp) score 276 or sacrifice child
    51 [20220707-022639][23457.838273] Killed process 676 (skyapp) total-vm:68812kB, anon-rss:16536kB, file-rss:820kB, shmem-rss:40kB
    52 [20220707-022639][23457.882911] oom_reaper: reaped process 676 (skyapp), now anon-rss:0kB, file-rss:0kB, shmem-rss:40kB
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 第35~49行,输出系统所欲可选的进程的相关信息

      • pid:进程的ID。
      • uid:进程的用户ID。
      • tgid:进程组中主线程ID。
      • tolal_vm:进程占用的虚拟内存。
      • rss: 进程占用的物理页面的数量,这里会统计进程占用的匿名页面、文件映射页面以及共享内存页面。以page为单位。
      • nr_ptes:页表条目。
      • nr_pmds:页中级目录条目。
      • Swapents:占用的交换分区的大小。
      • oom_score_adj:OOM分数。
      • name:进程名字。
    • 第50行,OOM Killer机制会选择占用物理内存最多的进程作为候选者,通过比较发现PID为676的进程(skyapp进程)为最佳选择,因此输出“Out of memory: Kill process 676 (skyapp) score 276 or sacrifice child”。

    • 第51~52行,比较候选者中的线程,选择一个占用内存最多的线程作为最终候选者,通过比较发现wicd-monitor线程(PID为676)占用内存最多,因此最终选择该线程来终止。

  • 相关阅读:
    vuepress-plugin-comment -use Valine
    深入理解Linux内核select多路复用原理
    部署之缓存问题
    Vue3 —— 常用 Composition API(一)(computed函数、watch函数、watchEffect函数、生命周期钩子)
    mysql主从&shardingsphere分库分表
    负载均衡四层和七层的区别
    axios封装(例如:请求头、token、超时、BaseUrl、请求错误、请求重复)
    基于食肉植物优化算法的线性规划问题求解matlab程序
    14. Longest Common Prefix
    Linux网络编程- ioctl()结合struct ifreq使用案例
  • 原文地址:https://blog.csdn.net/weixin_39247141/article/details/126304526