• Linux 内核参数:percpu_pagelist_fraction


    源码基于:Linux 5.4

    节点针对:

    /proc/sys/vm/percpu_pagelist_fraction

    0. 官方描述

    percpu_pagelist_fraction

    ========================

    This is the fraction of pages at most (high mark pcp->high) in each zone that

    are allocated for each per cpu page list.  The min value for this is 8.  It

    means that we don't allow more than 1/8th of pages in each zone to be

    allocated in any single per_cpu_pagelist.  This entry only changes the value

    of hot per cpu pagelists.  User can specify a number like 100 to allocate

    1/100th of each zone to each per cpu page list.

    The batch value of each per cpu pagelist is also updated as a result.  It is

    set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8)

    The initial value is zero.  Kernel does not use this value at boot time to set

    the high water marks for each per cpu page list.  If the user writes '0' to this

    sysctl, it will revert to this default behavior.

    • 这个节点主要是 pcp list 的high 值;此high 值用于记录每个 pcplist 拥有的最大pages数目,当pcp list 中的数目大于high 值时,需要将多余的page 释放回 buddy;
    • 这个节点值最小为8;即不允许多于 1/8 pages存放在 pcplist 中;
    • 当设定该节点时,pcp list 中的batch 也与此节点相关,通常batch 时 high / 4,但 batch 值上限时 PAGE_SHIFT * 8;
    • 这个节点的初始值为0,即pcplist 的high 和batch 不通过此节点计算,而是通过代码自身策略设定;

    1. 源码解析

    1. mm/page_alloc.c
    2. int percpu_pagelist_fraction;
    1. kernel/sysctl.c
    2. {
    3. .procname = "percpu_pagelist_fraction",
    4. .data = &percpu_pagelist_fraction,
    5. .maxlen = sizeof(percpu_pagelist_fraction),
    6. .mode = 0644,
    7. .proc_handler = percpu_pagelist_fraction_sysctl_handler,
    8. .extra1 = SYSCTL_ZERO,
    9. },
    1. mm/page_alloc.c
    2. int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
    3. void __user *buffer, size_t *length, loff_t *ppos)
    4. {
    5. struct zone *zone;
    6. int old_percpu_pagelist_fraction;
    7. int ret;
    8. mutex_lock(&pcp_batch_high_lock);
    9. old_percpu_pagelist_fraction = percpu_pagelist_fraction;
    10. ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
    11. if (!write || ret < 0)
    12. goto out;
    13. /* Sanity checking to avoid pcp imbalance */
    14. if (percpu_pagelist_fraction &&
    15. percpu_pagelist_fraction < MIN_PERCPU_PAGELIST_FRACTION) {
    16. percpu_pagelist_fraction = old_percpu_pagelist_fraction;
    17. ret = -EINVAL;
    18. goto out;
    19. }
    20. /* No change? */
    21. if (percpu_pagelist_fraction == old_percpu_pagelist_fraction)
    22. goto out;
    23. for_each_populated_zone(zone) {
    24. unsigned int cpu;
    25. for_each_possible_cpu(cpu)
    26. pageset_set_high_and_batch(zone,
    27. per_cpu_ptr(zone->pageset, cpu));
    28. }
    29. out:
    30. mutex_unlock(&pcp_batch_high_lock);
    31. return ret;
    32. }
    • 如果设定的值小于 MIN_PERCPU_PAGELIST_FRACTION(代码定义为 8),则使用原来的fraction 值;
    • 如果设定的值等于原来fraction,则直接返回;
    • 如果该节点设定的值有效,则通过函数 pageset_set_high_and_batch() 设定每个 cpu 下的pcp list 中的high 和batch值;
    1. mm/page_alloc.c
    2. static void pageset_set_high_and_batch(struct zone *zone,
    3. struct per_cpu_pageset *pcp)
    4. {
    5. if (percpu_pagelist_fraction)
    6. pageset_set_high(pcp,
    7. (zone_managed_pages(zone) /
    8. percpu_pagelist_fraction));
    9. else
    10. pageset_set_batch(pcp, zone_batchsize(zone));
    11. }

    变量 percpu_pagelist_fraction 来源于节点 /proc/sys/vm/percpu_pagelist_fraction,默认值为0。当指定 percpu_pagelist_fraction时,pcp 的high 和 batch 会通过pageset_set_high() 来设置:

    1. mm/page_alloc.c
    2. static void pageset_set_high(struct per_cpu_pageset *p,
    3. unsigned long high)
    4. {
    5. unsigned long batch = max(1UL, high / 4);
    6. if ((high / 4) > (PAGE_SHIFT * 8))
    7. batch = PAGE_SHIFT * 8;
    8. pageset_update(&p->pcp, high, batch);
    9. }
    •  参数highmanaged_pages / percpu_pagelist_fraction 而来;
    • batch 在high 的基础上除以4,但batch 要求不能大于 PAGE_SHIFT * 8 ,按照一个page 4KB来计算,batch 最大值不能超过 12 * 8 = 96;

    最终通过 /proc/zoneinfo 可以清晰查看实时的 pcp list情况:

    关于pcplist 及其内存使用,可以查看:buddy 系统分配器之快速分配(2)

     

     

  • 相关阅读:
    2019-2021年上市公司润灵ESG评分评级数据
    什么是数据采集与监视控制系统(SCADA)?
    用深度强化学习来玩Chrome小恐龙快跑
    Vue 中直接上手的性能优化方案
    [LitCTF 2023]导弹迷踪
    数据库中的范式
    XML-Based Configuration Beans for Ioc Container
    中间件 | Kafka - [安装 & 配置 & 启动]
    【Axure高保真原型】曲线图组和堆叠曲线图
    HarmonyOS/OpenHarmony原生应用-ArkTS万能卡片组件Stack
  • 原文地址:https://blog.csdn.net/jingerppp/article/details/126384772