• 深入linux内核架构--slab分配器(建议收藏)


    简介:malloc对于大家来说应该都不陌生了,这是系统库给我们提供了申请指定大小内存的函数,之前介绍的伙伴系统,只能以页的方式申请内存,对于小块(小于一页)内存的申请我们就得通过自定义的库函数来实现相关需求,所以在用户空间层面诞生了诸如ptmalloc(glibc),tcmalloc(google),jemalloc(facebook)等优秀的内存分配库。但是这些库内核没法使用,且内核也有大量申请小块内存的需求,诸如管理dentry,inode,fs_struct,page,task_struct等等一系列内核对象。所以内核提出了slab分配器,用来管理内核中小块内存分配,而cpu cache也是配合slab使用的,有时候也把slab称为缓存。

    内核中内存管理

    对于内核来说,slab主要包括kmalloc及kfree两个函数来分配及释放小块内存:
    kmalloc(size, flags):分配长度为size字节的一个内存区,并返回指向该内存区起始void指针,如果没有足够内存,返回NULL。

    kfree(ptr):释放ptr指向的内存区。

    对于内核开发者还可以通过kmem_cache_create创建一个缓存kmem_cache对象;
    通过kmem_cache_alloc、kmem_cache_alloc_node提供特定类型的内核缓存对象申请。他们最终都会调用到slab_alloc。所以主要的slab操作都在slab_alloc函数中。

    slab缓存由两部分组成:保存管理性数据的缓存对象和保存被管理对象的各个slab对象

    【文章福利】小编推荐自己的Linux内核技术交流群: 【977878001】整理一些个人觉得比较好得学习书籍、视频资料共享在群文件里面,有需要的可以自行添加哦!!!前100进群领取,额外赠送一份 价值699的内核资料包(含视频教程、电子书、实战项目及代码)

    内核资料直通车:Linux内核源码技术学习路线+视频教程代码资料

    学习直通车:Linux内核源码/内存调优/文件系统/进程管理/设备驱动/网络协议栈-学习视频教程-腾讯课堂

    slab cache
    上图中缓存即为kmem_cache,slab即为page页帧,缓存对象即为void指针。一个kmem_cache会在不同的内存节点管理很多页帧,这些页帧在各个内存节点被划分为3类:部分空闲,全部空闲以及全部分配。

    每个缓存kmem_cache对象只负责一种slab对象类型的管理,各个kmem_cache缓存中slab对象大小各不相同,由创建的时候指定,而缓存会根据指定的slab对象大小根据cpu cacheline 或者void*指针大小进行对齐,然后根据一个公式计算一个合适的gfporder来确定每次申请的内存页帧的数量。其计算方法为:

    1. 依次递增gfporder,这样关联页数就为2^gfporder,对应的字节数为PAGE_SIZE << gfporder;
    2. PAGE_SIZE << gfporder = head + num * size + left_over
    3. left_over * 8 <= PAGE_SIZE << gfporder 时就决定是这个gfporder,因为这是一个可以接受的碎片大小。

    对应的计算函数为calculate_slab_order,相关代码如下:

    1. /**
    2. * calculate_slab_order - calculate size (page order) of slabs
    3. * @cachep: pointer to the cache that is being created
    4. * @size: size of objects to be created in this cache.
    5. * @flags: slab allocation flags
    6. *
    7. * Also calculates the number of objects per slab.
    8. *
    9. * This could be made much more intelligent. For now, try to avoid using
    10. * high order pages for slabs. When the gfp() functions are more friendly
    11. * towards high-order requests, this should be changed.
    12. */
    13. static size_t calculate_slab_order(struct kmem_cache *cachep,
    14. size_t size, slab_flags_t flags)
    15. {
    16. size_t left_over = 0;
    17. int gfporder;
    18. for (gfporder = 0; gfporder <= KMALLOC_MAX_ORDER; gfporder++) {
    19. unsigned int num;
    20. size_t remainder;
    21. num = cache_estimate(gfporder, size, flags, &remainder);
    22. if (!num)
    23. continue;
    24. /* Can't handle number of objects more than SLAB_OBJ_MAX_NUM */
    25. if (num > SLAB_OBJ_MAX_NUM)
    26. break;
    27. if (flags & CFLGS_OFF_SLAB) {
    28. struct kmem_cache *freelist_cache;
    29. size_t freelist_size;
    30. freelist_size = num * sizeof(freelist_idx_t);
    31. freelist_cache = kmalloc_slab(freelist_size, 0u);
    32. if (!freelist_cache)
    33. continue;
    34. /*
    35. * Needed to avoid possible looping condition
    36. * in cache_grow_begin()
    37. */
    38. if (OFF_SLAB(freelist_cache))
    39. continue;
    40. /* check if off slab has enough benefit */
    41. if (freelist_cache->size > cachep->size / 2)
    42. continue;
    43. }
    44. /* Found something acceptable - save it away */
    45. cachep->num = num;
    46. cachep->gfporder = gfporder;
    47. left_over = remainder;
    48. /*
    49. * A VFS-reclaimable slab tends to have most allocations
    50. * as GFP_NOFS and we really don't want to have to be allocating
    51. * higher-order pages when we are unable to shrink dcache.
    52. */
    53. if (flags & SLAB_RECLAIM_ACCOUNT)
    54. break;
    55. /*
    56. * Large number of objects is good, but very large slabs are
    57. * currently bad for the gfp()s.
    58. */
    59. if (gfporder >= slab_max_order)
    60. break;
    61. /*
    62. * Acceptable internal fragmentation?
    63. */
    64. if (left_over * 8 <= (PAGE_SIZE << gfporder))
    65. break;
    66. }
    67. return left_over;
    68. }
    69. /*
    70. * Calculate the number of objects and left-over bytes for a given buffer size.
    71. */
    72. static unsigned int cache_estimate(unsigned long gfporder, size_t buffer_size,
    73. slab_flags_t flags, size_t *left_over)
    74. {
    75. unsigned int num;
    76. size_t slab_size = PAGE_SIZE << gfporder;
    77. /*
    78. * The slab management structure can be either off the slab or
    79. * on it. For the latter case, the memory allocated for a
    80. * slab is used for:
    81. *
    82. * - @buffer_size bytes for each object
    83. * - One freelist_idx_t for each object
    84. *
    85. * We don't need to consider alignment of freelist because
    86. * freelist will be at the end of slab page. The objects will be
    87. * at the correct alignment.
    88. *
    89. * If the slab management structure is off the slab, then the
    90. * alignment will already be calculated into the size. Because
    91. * the slabs are all pages aligned, the objects will be at the
    92. * correct alignment when allocated.
    93. */
    94. if (flags & (CFLGS_OBJFREELIST_SLAB | CFLGS_OFF_SLAB)) {
    95. num = slab_size / buffer_size;
    96. *left_over = slab_size % buffer_size;
    97. } else {
    98. num = slab_size / (buffer_size + sizeof(freelist_idx_t));
    99. *left_over = slab_size %
    100. (buffer_size + sizeof(freelist_idx_t));
    101. }
    102. return num;
    103. }

    系统中所有的缓存都保存在一个双链表中,这使得内核可以遍历所有的缓存,这主要用于缩减分配给内存的数量,常见的场景就是:dentry及inode slab缓存的回收,当机器物理内存不足时就会缩减这一部分内存占用(这一部分内存被称为SReclaimable,可以通过cat /proc/meminfo查看)。

    基本结构

    kmem_cache数据结构代表一个slab 缓存,其中有一些缓存元信息包括:缓存名,缓存对象大小,关联的内存页帧数,着色信息等等;还有一个__per_cpu array_cache用于表示该缓存在各个CPU中的slab对象;kmem_cache_node用于管理各个内存节点上slab对象的分配。

    array_cache是一个per_cpu数组,所以访问不需要加锁,是与cpu cache打交道的直接数据结构,每次获取空闲slab对象时都是通过entry[avail--]去获取,当avail==0时,又从kmem_cache_node中获取batchcount个空闲对象到array_cache中。

    kmem_cache_node用于管理slab(实际对象存储伙伴页帧),其会管理三个slab列表,部分空闲partial,全部空闲empty,全部占用full。array_cache获取batchcount空闲对象时,先尝试从partial分配,如果不够则再从empty分配剩余对象,如果都不够,则需要grow分配新的slab页帧

    page页帧,这个就不必多说了,这是物理存储地址,是一个union结构,当被用作slab时,会初始化一下slab管理数据,诸如起始object地址s_mem,lru缓存节点,是否被激活active,关联到的kmem_cache以及freelist空闲对象数组(是一个void*指针,其实存的是char or short数组)。

    具体数据结构如下:

    1. struct kmem_cache {
    2. /* 0) per-CPU数据,在每次分配/释放期间都会访问 */
    3. struct array_cache __percpu *cpu_cache; // 每个cpu中的slab对象
    4. /* 1) Cache tunables. Protected by slab_mutex */
    5. unsigned int batchcount; // 当__percpu cpu_cache为空时,从缓存slab中获取的对象数目,它还表示缓存增长时分配的对象数目。初始时为1,后续会调整。
    6. unsigned int limit; // __percpu cpu_cache中的对象数目上限,当slab free达到limit时,需要将array_caches中的部分obj返回到kmem_cache_node的页帧中。
    7. unsigned int shared;
    8. unsigned int size; // slab中的每个对象大小
    9. struct reciprocal_value reciprocal_buffer_size;
    10. /* 2) touched by every alloc & free from the backend */
    11. slab_flags_t flags; /* constant flags */
    12. unsigned int num; /* # of objs per slab */
    13. /* 3) cache_grow/shrink */
    14. /* order of pgs per slab (2^n) */
    15. unsigned int gfporder; // slab关联页数
    16. /* force GFP flags, e.g. GFP_DMA */
    17. gfp_t allocflags;
    18. size_t colour; /* cache colouring range */
    19. unsigned int colour_off; /* colour offset */
    20. struct kmem_cache *freelist_cache; // 空闲对象管理
    21. unsigned int freelist_size; // 空闲对象数量
    22. /* constructor func */
    23. void (*ctor)(void *obj); // 这个在2.6之后已经废弃了
    24. /* 4) cache creation/removal */
    25. const char *name;
    26. struct list_head list;
    27. int refcount;
    28. int object_size;
    29. int align;
    30. /* 5) statistics */
    31. #ifdef CONFIG_DEBUG_SLAB
    32. unsigned long num_active;
    33. unsigned long num_allocations;
    34. unsigned long high_mark;
    35. unsigned long grown;
    36. unsigned long reaped;
    37. unsigned long errors;
    38. unsigned long max_freeable;
    39. unsigned long node_allocs;
    40. unsigned long node_frees;
    41. unsigned long node_overflow;
    42. atomic_t allochit;
    43. atomic_t allocmiss;
    44. atomic_t freehit;
    45. atomic_t freemiss;
    46. #ifdef CONFIG_DEBUG_SLAB_LEAK
    47. atomic_t store_user_clean;
    48. #endif
    49. /*
    50. * If debugging is enabled, then the allocator can add additional
    51. * fields and/or padding to every object. 'size' contains the total
    52. * object size including these internal fields, while 'obj_offset'
    53. * and 'object_size' contain the offset to the user object and its
    54. * size.
    55. */
    56. int obj_offset;
    57. #endif /* CONFIG_DEBUG_SLAB */
    58. #ifdef CONFIG_MEMCG
    59. struct memcg_cache_params memcg_params;
    60. #endif
    61. #ifdef CONFIG_KASAN
    62. struct kasan_cache kasan_info;
    63. #endif
    64. #ifdef CONFIG_SLAB_FREELIST_RANDOM
    65. unsigned int *random_seq;
    66. #endif
    67. unsigned int useroffset; /* Usercopy region offset */
    68. unsigned int usersize; /* Usercopy region size */
    69. struct kmem_cache_node *node[MAX_NUMNODES]; // 每个内存节点上的slab对象信息,每个node上包括部分空闲,全满以及全部空闲三个队列
    70. };
    71. struct array_cache {
    72. unsigned int avail; // 保存了当前array中的可用数目
    73. unsigned int limit; // 同上
    74. unsigned int batchcount; // 同上
    75. unsigned int touched; // 缓存收缩时置0,缓存移除对象时置1,使得内核能确认在上一次收缩之后是否被访问过
    76. void *entry[]; /*
    77. * Must have this definition in here for the proper
    78. * alignment of array_cache. Also simplifies accessing
    79. * the entries.
    80. */
    81. };
    82. struct kmem_cache_node {
    83. spinlock_t list_lock;
    84. #ifdef CONFIG_SLAB
    85. struct list_head slabs_partial; /* partial list first, better asm code */
    86. struct list_head slabs_full;
    87. struct list_head slabs_free;
    88. unsigned long total_slabs; /* length of all slab lists */
    89. unsigned long free_slabs; /* length of free slab list only */
    90. unsigned long free_objects;
    91. unsigned int free_limit;
    92. unsigned int colour_next; /* Per-node cache coloring */
    93. struct array_cache *shared; /* shared per node */
    94. struct alien_cache **alien; /* on other nodes */
    95. unsigned long next_reap; /* updated without locking */
    96. int free_touched; /* updated without locking */
    97. #endif
    98. #ifdef CONFIG_SLUB
    99. unsigned long nr_partial;
    100. struct list_head partial;
    101. #ifdef CONFIG_SLUB_DEBUG
    102. atomic_long_t nr_slabs;
    103. atomic_long_t total_objects;
    104. struct list_head full;
    105. #endif
    106. #endif
    107. };

    kmalloc

    kmalloc的基本调用结构如下:

    主要包括两个操作:

    1. 从kmalloc_caches中获取kmem_cache(kmalloc_cache在slab初始化的时候已经生成好)。
    2. 通过slab_alloc从kmem_cache中分配一个slab对象,并返回。

    具体细节我们后续详细讨论slab_alloc的详细实现。

    kfree

    kfree最终会调用到___cache_free,具体我们再 slab free中详细讨论。

    初始化

    内核的大部分管理数据结构都是通过kmalloc分配内存的,那么slab本身结构的内存管理就出现了一个鸡与蛋的问题,slab数据结构所需内存远小于一整页的内存块,这些最适合kmalloc分配,而kmalloc只有在slab初始化完之后才能使用。


    所以就需要内核在启动时进行一些初始化操作,让内核在后续的数据结构能够找到对应的kmem_cache对象供kmalloc使用,而这一机制就是之前提到的kmalloc_caches。kernel启动时会先进行slab初始化,slab初始化时会先为kmalloc_caches分配内存,后续slab对象关联的kmem_cache可以从kmalloc_caches中获取,这样slab本身数据结构的内存分配就可以使用了,而这些内存会存在kmem_cache_boot这个缓存中。

    kmem_cache_init是初始化slab分配器的函数,其主要作用就是通过create_kmalloc_caches初始化kmalloc_caches。而kmalloc_caches数组中的kmem_cache对象本身则存储在kmem_cache_boot中,因为kmem_cache对象的大小是固定的,所以只需要一个kmem_cache就可以了,而该函数在start_kernel中被调用进行初始化,create_kmalloc_caches的实现如下:

    1. struct kmem_cache *
    2. kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1] __ro_after_init;
    3. EXPORT_SYMBOL(kmalloc_caches); /* [zone][index] */
    4. /*
    5. * Initialisation. Called after the page allocator have been initialised and
    6. * before smp_init().
    7. */
    8. void __init kmem_cache_init(void)
    9. {
    10. int I;
    11. kmem_cache = &kmem_cache_boot;
    12. if (!IS_ENABLED(CONFIG_NUMA) || num_possible_nodes() == 1)
    13. use_alien_caches = 0;
    14. for (i = 0; i < NUM_INIT_LISTS; I++)
    15. kmem_cache_node_init(&init_kmem_cache_node[i]); // 初始化kmem_cache_boot关联的kmem_cache_node,用于管理slab相关数据结构内存分配的实际存储,具体关联是在__kmem_cache_create中调用set_up_node(kmem_cache, CACHE_CACHE);以及如下代码的第5)会兜底关联一次。
    16. /*
    17. * Fragmentation resistance on low memory - only use bigger
    18. * page orders on machines with more than 32MB of memory if
    19. * not overridden on the command line.
    20. */
    21. if (!slab_max_order_set && totalram_pages() > (32 << 20) >> PAGE_SHIFT)
    22. slab_max_order = SLAB_MAX_ORDER_HI; // 设置slab每次申请的最大伙伴页帧
    23. /* Bootstrap is tricky, because several objects are allocated
    24. * from caches that do not exist yet:
    25. * 1) initialize the kmem_cache cache: it contains the struct
    26. * kmem_cache structures of all caches, except kmem_cache itself:
    27. * kmem_cache is statically allocated.
    28. * Initially an __init data area is used for the head array and the
    29. * kmem_cache_node structures, it's replaced with a kmalloc allocated
    30. * array at the end of the bootstrap.
    31. * 2) Create the first kmalloc cache.
    32. * The struct kmem_cache for the new cache is allocated normally.
    33. * An __init data area is used for the head array.
    34. * 3) Create the remaining kmalloc caches, with minimally sized
    35. * head arrays.
    36. * 4) Replace the __init data head arrays for kmem_cache and the first
    37. * kmalloc cache with kmalloc allocated arrays.
    38. * 5) Replace the __init data for kmem_cache_node for kmem_cache and
    39. * the other cache's with kmalloc allocated memory.
    40. * 6) Resize the head arrays of the kmalloc caches to their final sizes.
    41. */
    42. /* 1) create the kmem_cache */
    43. /*
    44. * struct kmem_cache size depends on nr_node_ids & nr_cpu_ids
    45. */
    46. create_boot_cache(kmem_cache, "kmem_cache",
    47. offsetof(struct kmem_cache, node) +
    48. nr_node_ids * sizeof(struct kmem_cache_node *),
    49. SLAB_HWCACHE_ALIGN, 0, 0);
    50. list_add(&kmem_cache->list, &slab_caches);
    51. memcg_link_cache(kmem_cache);
    52. slab_state = PARTIAL;
    53. /*
    54. * Initialize the caches that provide memory for the kmem_cache_node
    55. * structures first. Without this, further allocations will bug.
    56. */
    57. kmalloc_caches[KMALLOC_NORMAL][INDEX_NODE] = create_kmalloc_cache(
    58. kmalloc_info[INDEX_NODE].name,
    59. kmalloc_size(INDEX_NODE), ARCH_KMALLOC_FLAGS,
    60. 0, kmalloc_size(INDEX_NODE));
    61. slab_state = PARTIAL_NODE;
    62. setup_kmalloc_cache_index_table();
    63. slab_early_init = 0;
    64. /* 5) Replace the bootstrap kmem_cache_node */
    65. {
    66. int nid;
    67. for_each_online_node(nid) {
    68. init_list(kmem_cache, &init_kmem_cache_node[CACHE_CACHE + nid], nid);
    69. init_list(kmalloc_caches[KMALLOC_NORMAL][INDEX_NODE],
    70. &init_kmem_cache_node[SIZE_NODE + nid], nid);
    71. }
    72. }
    73. create_kmalloc_caches(ARCH_KMALLOC_FLAGS);
    74. }
    75. /*
    76. * Create the kmalloc array. Some of the regular kmalloc arrays
    77. * may already have been created because they were needed to
    78. * enable allocations for slab creation.
    79. */
    80. void __init create_kmalloc_caches(slab_flags_t flags)
    81. {
    82. int i, type;
    83. // 从zone及size两个维度初始化kmalloc_caches
    84. for (type = KMALLOC_NORMAL; type <= KMALLOC_RECLAIM; type++) {
    85. for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
    86. if (!kmalloc_caches[type][I])
    87. new_kmalloc_cache(i, type, flags);
    88. /*
    89. * Caches that are not of the two-to-the-power-of size.
    90. * These have to be created immediately after the
    91. * earlier power of two caches
    92. */
    93. if (KMALLOC_MIN_SIZE <= 32 && i == 6 &&
    94. !kmalloc_caches[type][1])
    95. new_kmalloc_cache(1, type, flags);
    96. if (KMALLOC_MIN_SIZE <= 64 && i == 7 &&
    97. !kmalloc_caches[type][2])
    98. new_kmalloc_cache(2, type, flags);
    99. }
    100. }
    101. /* Kmalloc array is now usable */
    102. slab_state = UP;
    103. #ifdef CONFIG_ZONE_DMA
    104. for (i = 0; i <= KMALLOC_SHIFT_HIGH; i++) {
    105. struct kmem_cache *s = kmalloc_caches[KMALLOC_NORMAL][i];
    106. if (s) {
    107. unsigned int size = kmalloc_size(i);
    108. const char *n = kmalloc_cache_name("dma-kmalloc", size);
    109. BUG_ON(!n);
    110. kmalloc_caches[KMALLOC_DMA][i] = create_kmalloc_cache(
    111. n, size, SLAB_CACHE_DMA | flags, 0, 0);
    112. }
    113. }
    114. #endif
    115. }
    116. static void __init
    117. new_kmalloc_cache(int idx, int type, slab_flags_t flags)
    118. {
    119. const char *name;
    120. if (type == KMALLOC_RECLAIM) {
    121. flags |= SLAB_RECLAIM_ACCOUNT;
    122. name = kmalloc_cache_name("kmalloc-rcl",
    123. kmalloc_info[idx].size);
    124. BUG_ON(!name);
    125. } else {
    126. name = kmalloc_info[idx].name;
    127. }
    128. kmalloc_caches[type][idx] = create_kmalloc_cache(name,
    129. kmalloc_info[idx].size, flags, 0,
    130. kmalloc_info[idx].size);
    131. }
    132. /* internal cache of cache description objs */
    133. static struct kmem_cache kmem_cache_boot = {
    134. .batchcount = 1,
    135. .limit = BOOT_CPUCACHE_ENTRIES,
    136. .shared = 1,
    137. .size = sizeof(struct kmem_cache),
    138. .name = "kmem_cache",
    139. };
    140. /*
    141. * 通过kmem_cache_zalloc分配对象内存(分配指定大小的内存,并填充为0字节)
    142. * 然后通过create_boot_cache填充数据
    143. */
    144. struct kmem_cache *__init create_kmalloc_cache(const char *name,
    145. unsigned int size, slab_flags_t flags,
    146. unsigned int useroffset, unsigned int usersize)
    147. {
    148. // kmem_cache == kmem_cache_boot
    149. struct kmem_cache *s = kmem_cache_zalloc(kmem_cache, GFP_NOWAIT);
    150. if (!s)
    151. panic("Out of memory when creating slab %s\n", name);
    152. create_boot_cache(s, name, size, flags, useroffset, usersize); // 最终会调用到__kmem_cache_create初始化kmem_cache中的数据,这一块在后续的缓存创建中详细介绍。
    153. list_add(&s->list, &slab_caches);
    154. memcg_link_cache(s);
    155. s->refcount = 1;
    156. return s;
    157. }

    基本调用关系图如下:

    slab 初始化主要是为了初始化kmem_cache_boot及kmalloc_caches数组,方便后续内核核心代码使用kmalloc,而不需要额外管理slab内存数据结构,这样slab本身的数据结构内存申请就可以直接使用上kmalloc了。kmem_cache相关数据都是由kmem_cache_boot管理的,kmalloc调用时使用的kmem_cache都是在初始化时已经创建好的kmalloc_caches中的kmem_cache缓存对象,上述__kmem_cache_create就是创建kmem_cache对象的关键函数,这个我们在后续slab create中详细介绍。

    在介绍slab 初始化的时候有提到__kmem_cache_create。其实kmem_cache_create最终也是通过__kmem_cache_create来初始化kmem_cache相关变量,具体调用关系图如下:

    slab create的主要目的就是初始化一个kmem_cache缓存对象中的起始值,主要包括:batchcount,size,align,useroffset,usersize,flags,colour,num,gfporder等等。但是这里有特别需要注意的是:batchcount及limit在create的时候是设为1,且并没有分配相应的页帧来存储slab对象,这些都需要在slab alloc发现没有可用对象时进行grow分配。

    slab create根据需要被管理的对象size,计算kmem_cache_node中管理的伙伴页帧阶gfporder大小,我们把这样一个伙伴页帧叫做slab,通过slab中的页数以及对象size就可以计算出一个slab中能存储的对象数量num = ((PAGE_SIZE << gfporder) - head) / size 下取整。
    代码如下:

    1. /**
    2. * __kmem_cache_create - Create a cache.
    3. * @cachep: cache management descriptor
    4. * @flags: SLAB flags
    5. *
    6. * Returns a ptr to the cache on success, NULL on failure.
    7. * Cannot be called within a int, but can be interrupted.
    8. * The @ctor is run when new pages are allocated by the cache.
    9. *
    10. * The flags are
    11. *
    12. * %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)
    13. * to catch references to uninitialised memory.
    14. *
    15. * %SLAB_RED_ZONE - Insert `Red' zones around the allocated memory to check
    16. * for buffer overruns.
    17. *
    18. * %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
    19. * cacheline. This can be beneficial if you're counting cycles as closely
    20. * as davem.
    21. */
    22. int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags)
    23. {
    24. size_t ralign = BYTES_PER_WORD; // 按字对齐
    25. gfp_t gfp;
    26. int err;
    27. unsigned int size = cachep->size;
    28. ... // debug info
    29. /*
    30. * Check that size is in terms of words. This is needed to avoid
    31. * unaligned accesses for some archs when redzoning is used, and makes
    32. * sure any on-slab bufctl's are also correctly aligned.
    33. */
    34. size = ALIGN(size, BYTES_PER_WORD);
    35. if (flags & SLAB_RED_ZONE) {
    36. ralign = REDZONE_ALIGN;
    37. /* If redzoning, ensure that the second redzone is suitably
    38. * aligned, by adjusting the object size accordingly. */
    39. // 危险区,在每个对象开始和结束处添加已知字节模式,方便程序员debug发现不正确的访问
    40. size = ALIGN(size, REDZONE_ALIGN);
    41. }
    42. /* 3) caller mandated alignment */
    43. if (ralign < cachep->align) {
    44. ralign = cachep->align;
    45. }
    46. /* disable debug if necessary */
    47. if (ralign > __alignof__(unsigned long long))
    48. flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);
    49. /*
    50. * 4) Store it.
    51. */
    52. cachep->align = ralign;
    53. cachep->colour_off = cache_line_size();
    54. /* Offset must be a multiple of the alignment. */
    55. if (cachep->colour_off < cachep->align)
    56. cachep->colour_off = cachep->align;
    57. if (slab_is_available())
    58. gfp = GFP_KERNEL;
    59. else
    60. gfp = GFP_NOWAIT;
    61. ... // debug info
    62. kasan_cache_create(cachep, &size, &flags);
    63. size = ALIGN(size, cachep->align);
    64. /*
    65. * We should restrict the number of objects in a slab to implement
    66. * byte sized index. Refer comment on SLAB_OBJ_MIN_SIZE definition.
    67. */
    68. if (FREELIST_BYTE_INDEX && size < SLAB_OBJ_MIN_SIZE)
    69. size = ALIGN(SLAB_OBJ_MIN_SIZE, cachep->align);
    70. ... // debug info
    71. if (set_objfreelist_slab_cache(cachep, size, flags)) { // 计算gfporder and num
    72. flags |= CFLGS_OBJFREELIST_SLAB;
    73. goto done;
    74. }
    75. if (set_off_slab_cache(cachep, size, flags)) { // 计算gfporder and num
    76. flags |= CFLGS_OFF_SLAB;
    77. goto done;
    78. }
    79. if (set_on_slab_cache(cachep, size, flags)) // 计算gfporder and num
    80. goto done;
    81. return -E2BIG;
    82. done:
    83. cachep->freelist_size = cachep->num * sizeof(freelist_idx_t);
    84. cachep->flags = flags;
    85. cachep->allocflags = __GFP_COMP;
    86. if (flags & SLAB_CACHE_DMA)
    87. cachep->allocflags |= GFP_DMA;
    88. if (flags & SLAB_CACHE_DMA32)
    89. cachep->allocflags |= GFP_DMA32;
    90. if (flags & SLAB_RECLAIM_ACCOUNT)
    91. cachep->allocflags |= __GFP_RECLAIMABLE;
    92. cachep->size = size;
    93. cachep->reciprocal_buffer_size = reciprocal_value(size);
    94. #if DEBUG
    95. /*
    96. * If we're going to use the generic kernel_map_pages()
    97. * poisoning, then it's going to smash the contents of
    98. * the redzone and userword anyhow, so switch them off.
    99. */
    100. if (IS_ENABLED(CONFIG_PAGE_POISONING) &&
    101. (cachep->flags & SLAB_POISON) &&
    102. is_debug_pagealloc_cache(cachep))
    103. cachep->flags &= ~(SLAB_RED_ZONE | SLAB_STORE_USER);
    104. #endif
    105. if (OFF_SLAB(cachep)) {
    106. cachep->freelist_cache =
    107. kmalloc_slab(cachep->freelist_size, 0u);
    108. }
    109. err = setup_cpu_cache(cachep, gfp); // 初始化array_caches及 kmem_cache_node
    110. if (err) {
    111. __kmem_cache_release(cachep);
    112. return err;
    113. }
    114. return 0;
    115. }
    116. static int __ref setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)
    117. {
    118. if (slab_state >= FULL)
    119. return enable_cpucache(cachep, gfp);
    120. cachep->cpu_cache = alloc_kmem_cache_cpus(cachep, 1, 1);
    121. if (!cachep->cpu_cache)
    122. return 1;
    123. /* init kmem_cache_node*/
    124. if (slab_state == DOWN) {
    125. /* Creation of first cache (kmem_cache). */
    126. set_up_node(kmem_cache, CACHE_CACHE);
    127. } else if (slab_state == PARTIAL) {
    128. /* For kmem_cache_node */
    129. set_up_node(cachep, SIZE_NODE);
    130. } else {
    131. int node;
    132. for_each_online_node(node) {
    133. cachep->node[node] = kmalloc_node(
    134. sizeof(struct kmem_cache_node), gfp, node);
    135. BUG_ON(!cachep->node[node]);
    136. kmem_cache_node_init(cachep->node[node]);
    137. }
    138. }
    139. cachep->node[numa_mem_id()]->next_reap =
    140. jiffies + REAPTIMEOUT_NODE +
    141. ((unsigned long)cachep) % REAPTIMEOUT_NODE;
    142. /* init array_caches*/
    143. cpu_cache_get(cachep)->avail = 0;
    144. cpu_cache_get(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
    145. cpu_cache_get(cachep)->batchcount = 1;
    146. cpu_cache_get(cachep)->touched = 0;
    147. cachep->batchcount = 1;
    148. cachep->limit = BOOT_CPUCACHE_ENTRIES;
    149. return 0;
    150. }

    这里需要特别单独说一下的是kmalloc_node,这是分配kmem_cache_node的关键函数,相关函数调用如下:

    相关代码如下:

    1. static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
    2. {
    3. ... // slob 先不考虑
    4. return __kmalloc_node(size, flags, node);
    5. }
    6. void *__kmalloc_node(size_t size, gfp_t flags, int node)
    7. {
    8. return __do_kmalloc_node(size, flags, node, _RET_IP_);
    9. }
    10. static __always_inline void *
    11. __do_kmalloc_node(size_t size, gfp_t flags, int node, unsigned long caller)
    12. {
    13. struct kmem_cache *cachep;
    14. void *ret;
    15. if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
    16. return NULL;
    17. cachep = kmalloc_slab(size, flags);
    18. if (unlikely(ZERO_OR_NULL_PTR(cachep)))
    19. return cachep;
    20. ret = kmem_cache_alloc_node_trace(cachep, flags, node, size);
    21. ret = kasan_kmalloc(cachep, ret, size, flags);
    22. return ret;
    23. }
    24. void *kmem_cache_alloc_node_trace(struct kmem_cache *cachep,
    25. gfp_t flags,
    26. int nodeid,
    27. size_t size)
    28. {
    29. void *ret;
    30. ret = slab_alloc_node(cachep, flags, nodeid, _RET_IP_);
    31. ret = kasan_kmalloc(cachep, ret, size, flags);
    32. trace_kmalloc_node(_RET_IP_, ret,
    33. size, cachep->size,
    34. flags, nodeid);
    35. return ret;
    36. }
    37. static __always_inline void *
    38. slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
    39. unsigned long caller)
    40. {
    41. unsigned long save_flags;
    42. void *ptr;
    43. int slab_node = numa_mem_id();
    44. /* 参数及相关环境检查*/
    45. flags &= gfp_allowed_mask;
    46. cachep = slab_pre_alloc_hook(cachep, flags);
    47. if (unlikely(!cachep))
    48. return NULL;
    49. cache_alloc_debugcheck_before(cachep, flags);
    50. local_irq_save(save_flags);
    51. if (nodeid == NUMA_NO_NODE)
    52. nodeid = slab_node;
    53. if (unlikely(!get_node(cachep, nodeid))) {
    54. /* Node not bootstrapped yet */
    55. ptr = fallback_alloc(cachep, flags);
    56. goto out;
    57. }
    58. if (nodeid == slab_node) {
    59. /*
    60. * Use the locally cached objects if possible.
    61. * However ____cache_alloc does not allow fallback
    62. * to other nodes. It may fail while we still have
    63. * objects on other nodes available.
    64. */
    65. ptr = ____cache_alloc(cachep, flags); // 从本地array_cache中直接分配对应的slab obj去存储kmem_cache_node
    66. if (ptr)
    67. goto out;
    68. }
    69. /* ___cache_alloc_node can fall back to other nodes */
    70. ptr = ____cache_alloc_node(cachep, flags, nodeid);
    71. out:
    72. local_irq_restore(save_flags);
    73. ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
    74. if (unlikely(flags & __GFP_ZERO) && ptr)
    75. memset(ptr, 0, cachep->object_size);
    76. slab_post_alloc_hook(cachep, flags, 1, &ptr);
    77. return ptr;
    78. }
    79. /*
    80. * A interface to enable slab creation on nodeid
    81. */
    82. static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
    83. int nodeid)
    84. {
    85. struct page *page;
    86. struct kmem_cache_node *n;
    87. void *obj = NULL;
    88. void *list = NULL;
    89. VM_BUG_ON(nodeid < 0 || nodeid >= MAX_NUMNODES);
    90. n = get_node(cachep, nodeid);
    91. BUG_ON(!n);
    92. check_irq_off();
    93. spin_lock(&n->list_lock);
    94. page = get_first_slab(n, false);
    95. if (!page)
    96. goto must_grow;
    97. check_spinlock_acquired_node(cachep, nodeid);
    98. STATS_INC_NODEALLOCS(cachep);
    99. STATS_INC_ACTIVE(cachep);
    100. STATS_SET_HIGH(cachep);
    101. BUG_ON(page->active == cachep->num);
    102. obj = slab_get_obj(cachep, page);
    103. n->free_objects--;
    104. fixup_slab_list(cachep, n, page, &list);
    105. spin_unlock(&n->list_lock);
    106. fixup_objfreelist_debug(cachep, &list);
    107. return obj;
    108. must_grow:
    109. spin_unlock(&n->list_lock);
    110. page = cache_grow_begin(cachep, gfp_exact_node(flags), nodeid);
    111. if (page) {
    112. /* This slab isn't counted yet so don't update free_objects */
    113. obj = slab_get_obj(cachep, page);
    114. }
    115. cache_grow_end(cachep, page);
    116. return obj ? obj : fallback_alloc(cachep, flags);
    117. }

    从上面的代码发现slab本身的管理对象也由slab对象直接管理。

    slab alloc

    slab alloc的基本过程如下,最终会调用到____cache_alloc,如果在array_cache中能够找到的话,就直接返回否则通过cache_alloc_refill重新填充array_cache中的空闲对象。

    相关代码如下:

    1. static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
    2. {
    3. void *objp;
    4. struct array_cache *ac;
    5. check_irq_off();
    6. ac = cpu_cache_get(cachep);
    7. if (likely(ac->avail)) {
    8. ac->touched = 1;
    9. objp = ac->entry[--ac->avail];
    10. STATS_INC_ALLOCHIT(cachep);
    11. goto out;
    12. }
    13. STATS_INC_ALLOCMISS(cachep);
    14. objp = cache_alloc_refill(cachep, flags);
    15. /*
    16. * the 'ac' may be updated by cache_alloc_refill(),
    17. * and kmemleak_erase() requires its correct value.
    18. */
    19. ac = cpu_cache_get(cachep);
    20. out:
    21. /*
    22. * To avoid a false negative, if an object that is in one of the
    23. * per-CPU caches is leaked, we need to make sure kmemleak doesn't
    24. * treat the array pointers as a reference to the object.
    25. */
    26. if (objp)
    27. kmemleak_erase(&ac->entry[ac->avail]);
    28. return objp;
    29. }
    30. static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
    31. {
    32. int batchcount;
    33. struct kmem_cache_node *n;
    34. struct array_cache *ac, *shared;
    35. int node;
    36. void *list = NULL;
    37. struct page *page;
    38. check_irq_off();
    39. node = numa_mem_id();
    40. ac = cpu_cache_get(cachep);
    41. batchcount = ac->batchcount;
    42. if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
    43. /*
    44. * If there was little recent activity on this cache, then
    45. * perform only a partial refill. Otherwise we could generate
    46. * refill bouncing.
    47. */
    48. batchcount = BATCHREFILL_LIMIT; // 16
    49. }
    50. n = get_node(cachep, node);
    51. BUG_ON(ac->avail > 0 || !n);
    52. shared = READ_ONCE(n->shared);
    53. if (!n->free_objects && (!shared || !shared->avail))
    54. goto direct_grow;
    55. spin_lock(&n->list_lock);
    56. shared = READ_ONCE(n->shared);
    57. /* See if we can refill from the shared array */
    58. if (shared && transfer_objects(ac, shared, batchcount)) {
    59. shared->touched = 1;
    60. goto alloc_done;
    61. }
    62. while (batchcount > 0) {
    63. /* Get slab alloc is to come from. */
    64. page = get_first_slab(n, false);
    65. if (!page)
    66. goto must_grow;
    67. check_spinlock_acquired(cachep);
    68. batchcount = alloc_block(cachep, ac, page, batchcount);
    69. fixup_slab_list(cachep, n, page, &list);
    70. }
    71. must_grow:
    72. n->free_objects -= ac->avail;
    73. alloc_done:
    74. spin_unlock(&n->list_lock);
    75. fixup_objfreelist_debug(cachep, &list);
    76. direct_grow:
    77. if (unlikely(!ac->avail)) {
    78. /* Check if we can use obj in pfmemalloc slab */
    79. if (sk_memalloc_socks()) {
    80. void *obj = cache_alloc_pfmemalloc(cachep, n, flags);
    81. if (obj)
    82. return obj;
    83. }
    84. page = cache_grow_begin(cachep, gfp_exact_node(flags), node); // 分配一个gfporder伙伴页帧,并初始化伙伴页帧中的freelist
    85. /*
    86. * cache_grow_begin() can reenable interrupts,
    87. * then ac could change.
    88. */
    89. ac = cpu_cache_get(cachep);
    90. if (!ac->avail && page)
    91. alloc_block(cachep, ac, page, batchcount);
    92. cache_grow_end(cachep, page);// 将新申请的伙伴页帧加入到kmem_cache_node中的free或者partial或者full列表中
    93. if (!ac->avail)
    94. return NULL;
    95. }
    96. ac->touched = 1;
    97. return ac->entry[--ac->avail];
    98. }
    99. static struct page *get_first_slab(struct kmem_cache_node *n, bool pfmemalloc)
    100. {
    101. struct page *page;
    102. assert_spin_locked(&n->list_lock);
    103. page = list_first_entry_or_null(&n->slabs_partial, struct page, lru); // 尝试获取部分空闲的page
    104. if (!page) {
    105. n->free_touched = 1;
    106. page = list_first_entry_or_null(&n->slabs_free, struct page, lru); // 尝试获取全部空闲的page
    107. if (page)
    108. n->free_slabs--;
    109. }
    110. if (sk_memalloc_socks()) // 这种优化方案目前不知道是怎么搞的
    111. page = get_valid_first_slab(n, page, pfmemalloc);
    112. return page;
    113. }
    114. /*
    115. * Grow (by 1) the number of slabs within a cache. This is called by
    116. * kmem_cache_alloc() when there are no active objs left in a cache.
    117. */
    118. static struct page *cache_grow_begin(struct kmem_cache *cachep,
    119. gfp_t flags, int nodeid)
    120. {
    121. ...
    122. /*
    123. * Be lazy and only check for valid flags here, keeping it out of the
    124. * critical path in kmem_cache_alloc().
    125. */
    126. ...
    127. /*
    128. * Get mem for the objs. Attempt to allocate a physical page from
    129. * 'nodeid'.
    130. */
    131. page = kmem_getpages(cachep, local_flags, nodeid); // 在对应的内存节点通过伙伴系统分配一个gfporder的连续页
    132. if (!page)
    133. goto failed;
    134. page_node = page_to_nid(page);
    135. n = get_node(cachep, page_node);
    136. /* Get colour for the slab, and cal the next value. */
    137. ... // 着色
    138. /*
    139. * Call kasan_poison_slab() before calling alloc_slabmgmt(), so
    140. * page_address() in the latter returns a non-tagged pointer,
    141. * as it should be for slab pages.
    142. */
    143. kasan_poison_slab(page); // 毒化,在建立和释放slab时,易于感知未授权访问
    144. /* Get slab management. */
    145. freelist = alloc_slabmgmt(cachep, page, offset,
    146. local_flags & ~GFP_CONSTRAINT_MASK, page_node); // 获取第一个空闲对象地址,一般来说是在指定offset(着色偏移)之后的地址。
    147. if (OFF_SLAB(cachep) && !freelist)
    148. goto opps1;
    149. slab_map_pages(cachep, page, freelist); // 初始化page slab信息
    150. cache_init_objs(cachep, page); // 将页帧中所有slab对象起始地址随机打散到freelist列表中,
    151. if (gfpflags_allow_blocking(local_flags))
    152. local_irq_disable();
    153. return page;
    154. opps1:
    155. kmem_freepages(cachep, page);
    156. failed:
    157. if (gfpflags_allow_blocking(local_flags))
    158. local_irq_disable();
    159. return NULL;
    160. }
    161. static void cache_init_objs(struct kmem_cache *cachep,
    162. struct page *page)
    163. {
    164. int I;
    165. void *objp;
    166. bool shuffled;
    167. cache_init_objs_debug(cachep, page);
    168. /* Try to randomize the freelist if enabled */
    169. shuffled = shuffle_freelist(cachep, page);
    170. if (!shuffled && OBJFREELIST_SLAB(cachep)) {
    171. page->freelist = index_to_obj(cachep, page, cachep->num - 1) +
    172. obj_offset(cachep);
    173. }
    174. for (i = 0; i < cachep->num; i++) {
    175. objp = index_to_obj(cachep, page, i);
    176. objp = kasan_init_slab_obj(cachep, objp);
    177. /* constructor could break poison info */
    178. if (DEBUG == 0 && cachep->ctor) {
    179. kasan_unpoison_object_data(cachep, objp);
    180. cachep->ctor(objp);
    181. kasan_poison_object_data(cachep, objp);
    182. }
    183. if (!shuffled)
    184. set_free_obj(page, i, i);
    185. }
    186. }
    187. /*
    188. * Shuffle the freelist initialization state based on pre-computed lists.
    189. * return true if the list was successfully shuffled, false otherwise.
    190. */
    191. static bool shuffle_freelist(struct kmem_cache *cachep, struct page *page)
    192. {
    193. unsigned int objfreelist = 0, i, rand, count = cachep->num;
    194. union freelist_init_state state;
    195. bool precomputed;
    196. if (count < 2)
    197. return false;
    198. precomputed = freelist_state_initialize(&state, cachep, count);
    199. /* Take a random entry as the objfreelist */
    200. if (OBJFREELIST_SLAB(cachep)) {
    201. if (!precomputed)
    202. objfreelist = count - 1;
    203. else
    204. objfreelist = next_random_slot(&state);
    205. page->freelist = index_to_obj(cachep, page, objfreelist) +
    206. obj_offset(cachep);
    207. count--;
    208. }
    209. /*
    210. * On early boot, generate the list dynamically.
    211. * Later use a pre-computed list for speed.
    212. */
    213. if (!precomputed) {
    214. for (i = 0; i < count; I++)
    215. set_free_obj(page, i, i);
    216. /* Fisher-Yates shuffle */
    217. for (i = count - 1; i > 0; i--) {
    218. rand = prandom_u32_state(&state.rnd_state);
    219. rand %= (i + 1);
    220. swap_free_obj(page, i, rand);
    221. }
    222. } else {
    223. for (i = 0; i < count; I++)
    224. set_free_obj(page, i, next_random_slot(&state));
    225. }
    226. if (OBJFREELIST_SLAB(cachep))
    227. set_free_obj(page, cachep->num - 1, objfreelist);
    228. return true;
    229. }

    最终slab对象分配情况如下图:(起始freelist在实际场景中是随意的交叉错乱指向任意一块slab对象,不会像下面这样整齐)

    结合之前的slab data struct图,应该基本就对slab alloc有个比较清晰的理解了。

    slab free

    kmem_cache_free函数时释放slab对象的一个主要入口,函数调用图如下:

    slab free

    相关代码如下:

    1. void ___cache_free(struct kmem_cache *cachep, void *objp,
    2. unsigned long caller)
    3. {
    4. struct array_cache *ac = cpu_cache_get(cachep);
    5. check_irq_off();
    6. kmemleak_free_recursive(objp, cachep->flags);
    7. objp = cache_free_debugcheck(cachep, objp, caller);
    8. /*
    9. * Skip calling cache_free_alien() when the platform is not numa.
    10. * This will avoid cache misses that happen while accessing slabp (which
    11. * is per page memory reference) to get nodeid. Instead use a global
    12. * variable to skip the call, which is mostly likely to be present in
    13. * the cache.
    14. */
    15. if (nr_online_nodes > 1 && cache_free_alien(cachep, objp))
    16. return;
    17. if (ac->avail < ac->limit) {
    18. STATS_INC_FREEHIT(cachep);
    19. } else {
    20. STATS_INC_FREEMISS(cachep);
    21. cache_flusharray(cachep, ac); // 超出了array_caches limit,需要收缩
    22. }
    23. if (sk_memalloc_socks()) {
    24. struct page *page = virt_to_head_page(objp);
    25. if (unlikely(PageSlabPfmemalloc(page))) {
    26. cache_free_pfmemalloc(cachep, page, objp);
    27. return;
    28. }
    29. }
    30. ac->entry[ac->avail++] = objp;
    31. }
    32. static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
    33. {
    34. int batchcount;
    35. struct kmem_cache_node *n;
    36. int node = numa_mem_id();
    37. LIST_HEAD(list);
    38. batchcount = ac->batchcount;
    39. check_irq_off();
    40. n = get_node(cachep, node);
    41. spin_lock(&n->list_lock);
    42. if (n->shared) { // 优先放回shared区
    43. struct array_cache *shared_array = n->shared;
    44. int max = shared_array->limit - shared_array->avail;
    45. if (max) {
    46. if (batchcount > max)
    47. batchcount = max;
    48. memcpy(&(shared_array->entry[shared_array->avail]),
    49. ac->entry, sizeof(void *) * batchcount);
    50. shared_array->avail += batchcount;
    51. goto free_done;
    52. }
    53. }
    54. free_block(cachep, ac->entry, batchcount, node, &list); // 否则只能放回到kmem_cache_node的页帧中,并将可被回收的pages放到list链表中
    55. free_done:
    56. #if STATS
    57. {
    58. int i = 0;
    59. struct page *page;
    60. list_for_each_entry(page, &n->slabs_free, lru) {
    61. BUG_ON(page->active);
    62. I++;
    63. }
    64. STATS_SET_FREEABLE(cachep, i);
    65. }
    66. #endif
    67. spin_unlock(&n->list_lock);
    68. slabs_destroy(cachep, &list); // 将可回收的页帧释放回伙伴系统
    69. ac->avail -= batchcount;
    70. memmove(ac->entry, &(ac->entry[batchcount]), sizeof(void *)*ac->avail);
    71. }

    总结

    简而言之,slab缓存的基本思想如下:

    内核使用slab分配器时,主要包括两个步骤:

    1. 通过kmem_cache_create创建一个缓存。
    2. 通过kmem_cache_alloc在指定缓存中申请一个slab对象。

    细分描述如下:

    1. 通过kmem_cache_create(size, flags)函数创建一个缓存对象,根据size计算缓存的gfporder,从而可以知道每次新分配管理页帧的数量,也就可以计算一个kmem_cache_node页帧中能管理的slab对象数目num。
    2. 当通过kmem_cache_create(*cachep, flags)函数申请slab对象分配时,首先查找cachep->array_caches[curr_cpu]中是否有可用空闲对象,有则直接返回,没有则执行步骤3。
    3. 从kmem_cache_node中一次导入batchcount数量的空闲对象到array_caches,导入规则是,先选择部分空闲页帧,然后是全部空闲页帧,将内存中对象装载到array_cache中,当没有可用页帧时执行grow,跳入步骤4,否则跳入步骤6。
    4. 通过伙伴系统申请2^gfporder数量的页,并初始化slab头部信息及空闲列表信息,就每个空闲对象的起始地址随机打散到page->freelist数组中,这样每次分配时指向的地址会比较随机。并将该页帧伙伴放入kmem_cache_node->slab_free中。
    5. 继续像步骤3一样继续装载空闲对象到array_cache,直至batchcount为0后跳入步骤6。
      6.返回array_cache->entry[--array_cache->avail];

    具体可以结合slab data struct图一起看,应该会有一个比较清晰的整体感观。

    slab coloring:

    在深入理解linux内核架构原文中,是这么描述的:如果数据存储在伙伴系统直接提供的页中,那么其地址总是出现在2的幂次的整数倍附近。这对CPU高速缓存的利用有负面影响,由于这种地址分布,使得某些缓存行过度使用,而其他的则几乎为空。通过slab着色(slab coloring),slab分配器能够均匀的分布对象,以实现均匀的缓存利用,在看了slab coloring的实现原理后我百思不得其解,为啥通过对页内slab对象加一个偏移就可以让缓存命中均匀了呢,通过翻阅大量资料发现,其实这个slab coloring好像并没有什么卵用

    要理解slab着色并不能让缓存利用更均匀,首先得对cpu cache有一定了解:cpu cache将cache划分成固定大小的cache line,一般来说一个cache line的大小为64Byte,然后将cache line与内存地址映射。cpu cache目前主要有三种映射:1、直接映射;2、全相连映射;3、组相连映射。缓存行由三个部分组成 valid,tag,data;其中valid表示该缓存行中数据是否有效,tag用于判断对应cach行存储的确实是目标地址,data则是缓存行具体的64Byte数据。

    我们以32位物理内存地址,1MB cache,64Byte cache line系统为例:
    那么该系统缓存地址有20位(2^20 = 1MB),其中低6位用于定位缓存行内数据,高14位用于定位缓存行,而我们的物理内存地址有32位,那么同一缓存行中可能存储2^12种不同地址,所以就需要一个12位的TAG来确定缓存中存储的确实是指定物理地址的数据。比如0x000XXX和0x001XXX都会命中同一缓存行,而通过TAG中存储的值来确认存储的是0x000XXX还是0x001XXX的数据。

    具体过程如下:

    而缓存行的定位跟缓存与物理地址映射有关:

    直接映射

    直接映射中cache地址与内存地址是固定好的映射关系;因此可能会存在某些行过热的情况出现(某些倍数地址临时访问频繁,这种情况在大数据处理中很常见,hash shuffle的时候有时候很容易出现长尾)。

    全相连映射

    全相连映射中,每一个内存块可以随意的映射到任意一个cache line,由于可以随意映射,所以每次访问的时候要匹配每个cache line的tag来确认是否是对应的物理地址。

    组相连映射

    组相连映射是直接映射和全相连映射的折中,组间直接映射,组内全相连映射,这样一定程度上可以缓解热点问题,又能避免对比太多TAG。

    经过上面的分析,对于CPU hardware cache有了大致的了解。那么slab coloring这种偏移页内一个cache line的做法对于避免缓存行有什么帮助呢?看上去并没有什么太多用,看下面的分析:

    比如一个slab A分配的page起始地址为0X00000000,另一个slab B分配的page起始地址为0X00002000,cache的大小为1MB,每个缓存行大小为64Byte,一共有16384行。那么slab A对应的缓存行为0~63行,而slab B对应的缓存行为128~191行,他们本来就不会命中到同一缓存行,你页内再怎么偏移也没有任何关系。即使如果两个slab对应的page会命中到同一缓存行,页内的偏移也看上去不会有什么不同的改善,是不是?反而组相连可以比较好的解决相同命中时的冲突问题。

    但是仔细想一想kmem_cache_node中的slab页帧结构你会发现,如果没有着色偏移的话,那么每次对应的缓存行一般都是从一个固定值开始的,假如我们忽略slab head的大小,每次slab对象都从页的起始位置开始分配,由于slab对象与cache line 一般来说是不等的,那么每次映射的缓存行一般是0 ~ 61或者0 ~ 62这种,所以62,63这种缓存行就会有较低访问概率,所以就需要这个着色偏移,让首尾缓存访问的概率趋于相同

  • 相关阅读:
    sylar高性能服务器-日志(P1-P6)代码解析+调试分析
    可视化 | (三)Edward Tufted基本设计准则
    监控到底要哪些指标?
    【C++模块实现】| 【07】对于互斥、自旋锁、条件变量、信号量简介及封装
    【linux命令讲解大全】113.网络接口和系统设备监测工具ifstat和iostat的使用
    华为常用命令
    《算法通关村—最大小栈问题解析》
    大数据调度最佳实践 | 从Airflow迁移到Apache DolphinScheduler
    String面试总结
    【语义分割 01】Open MMLab介绍
  • 原文地址:https://blog.csdn.net/m0_74282605/article/details/127819805