源码基于:Linux 5.4
节点针对:
/proc/sys/vm/percpu_pagelist_fraction
percpu_pagelist_fraction
========================
This is the fraction of pages at most (high mark pcp->high) in each zone that
are allocated for each per cpu page list. The min value for this is 8. It
means that we don't allow more than 1/8th of pages in each zone to be
allocated in any single per_cpu_pagelist. This entry only changes the value
of hot per cpu pagelists. User can specify a number like 100 to allocate
1/100th of each zone to each per cpu page list.
The batch value of each per cpu pagelist is also updated as a result. It is
set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
The initial value is zero. Kernel does not use this value at boot time to set
the high water marks for each per cpu page list. If the user writes '0' to this
sysctl, it will revert to this default behavior.
- mm/page_alloc.c
-
- int percpu_pagelist_fraction;
- kernel/sysctl.c
-
- {
- .procname = "percpu_pagelist_fraction",
- .data = &percpu_pagelist_fraction,
- .maxlen = sizeof(percpu_pagelist_fraction),
- .mode = 0644,
- .proc_handler = percpu_pagelist_fraction_sysctl_handler,
- .extra1 = SYSCTL_ZERO,
- },
- mm/page_alloc.c
-
- int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
- void __user *buffer, size_t *length, loff_t *ppos)
- {
- struct zone *zone;
- int old_percpu_pagelist_fraction;
- int ret;
-
- mutex_lock(&pcp_batch_high_lock);
- old_percpu_pagelist_fraction = percpu_pagelist_fraction;
-
- ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
- if (!write || ret < 0)
- goto out;
-
- /* Sanity checking to avoid pcp imbalance */
- if (percpu_pagelist_fraction &&
- percpu_pagelist_fraction < MIN_PERCPU_PAGELIST_FRACTION) {
- percpu_pagelist_fraction = old_percpu_pagelist_fraction;
- ret = -EINVAL;
- goto out;
- }
-
- /* No change? */
- if (percpu_pagelist_fraction == old_percpu_pagelist_fraction)
- goto out;
-
- for_each_populated_zone(zone) {
- unsigned int cpu;
-
- for_each_possible_cpu(cpu)
- pageset_set_high_and_batch(zone,
- per_cpu_ptr(zone->pageset, cpu));
- }
- out:
- mutex_unlock(&pcp_batch_high_lock);
- return ret;
- }
- mm/page_alloc.c
-
- static void pageset_set_high_and_batch(struct zone *zone,
- struct per_cpu_pageset *pcp)
- {
- if (percpu_pagelist_fraction)
- pageset_set_high(pcp,
- (zone_managed_pages(zone) /
- percpu_pagelist_fraction));
- else
- pageset_set_batch(pcp, zone_batchsize(zone));
- }
变量 percpu_pagelist_fraction 来源于节点 /proc/sys/vm/percpu_pagelist_fraction,默认值为0。当指定 percpu_pagelist_fraction时,pcp 的high 和 batch 会通过pageset_set_high() 来设置:
- mm/page_alloc.c
-
- static void pageset_set_high(struct per_cpu_pageset *p,
- unsigned long high)
- {
- unsigned long batch = max(1UL, high / 4);
- if ((high / 4) > (PAGE_SHIFT * 8))
- batch = PAGE_SHIFT * 8;
-
- pageset_update(&p->pcp, high, batch);
- }
最终通过 /proc/zoneinfo 可以清晰查看实时的 pcp list情况:

关于pcplist 及其内存使用,可以查看:buddy 系统分配器之快速分配(2)