• hugetlbfs的写时复制


    本文分析基于linux内核4.19.195

    最近想到了一个以前没有考虑过的问题:
    hugetlbfs大页遇到cow时,是怎么处理的?

    最开始的想法是,按照cow的原理,内核会重新申请一个大页,供父进程或子进程中第一个触发写操作的进程进行使用,但事实真的是如此吗?这么做的话,如果我们一开始往/sys/kernel/mm/hugepages/hugepages-xxxkB/nr_hugepages里写的值不够会怎么样?

    先给出结论,然后再进行分析:

    1. 若使用MAP_SHARED进行的映射,则不会走cow,在fork时就会将页表建立好,fork完后子进程访问该段内存也不会触发缺页中断
    2. 若没使用MAP_SHARED进行的映射,则父进程继续使用原先分配出来的大页内存,而子进程一开始也会映射到原先的大页内存,但是一旦父进程或者子进程进行写操作,就会触发cow(这里和普通页的cow是一样的),如果页分配成功,则最终的结果是一个进程使用新的大页,一个进程继续使用旧的大页。

    下面先结合代码分析情况1。
    执行fork()时,会有如下流程:
    _do_fork()->copy_process()->copy_mm()->dup_mm()->dup_mmap()->copy_page_range()->copy_hugetlb_page_range()

    int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
    			    struct vm_area_struct *vma)
    {
    	pte_t *src_pte, *dst_pte, entry, dst_entry;
    	struct page *ptepage;
    	unsigned long addr;
    	int cow;
    	struct hstate *h = hstate_vma(vma);
    	unsigned long sz = huge_page_size(h);
    	unsigned long mmun_start;	/* For mmu_notifiers */
    	unsigned long mmun_end;		/* For mmu_notifiers */
    	int ret = 0;
    
    	cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
    
    	mmun_start = vma->vm_start;
    	mmun_end = vma->vm_end;
    	if (cow)
    		mmu_notifier_invalidate_range_start(src, mmun_start, mmun_end);
    
    	for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
    		spinlock_t *src_ptl, *dst_ptl;
    		src_pte = huge_pte_offset(src, addr, sz);
    		if (!src_pte)
    			continue;
    		dst_pte = huge_pte_alloc(dst, addr, sz);
    		if (!dst_pte) {
    			ret = -ENOMEM;
    			break;
    		}
    
    		/*
    		 * If the pagetables are shared don't copy or take references.
    		 * dst_pte == src_pte is the common case of src/dest sharing.
    		 *
    		 * However, src could have 'unshared' and dst shares with
    		 * another vma.  If dst_pte !none, this implies sharing.
    		 * Check here before taking page table lock, and once again
    		 * after taking the lock below.
    		 */
    		dst_entry = huge_ptep_get(dst_pte);
    		if ((dst_pte == src_pte) || !huge_pte_none(dst_entry))
    			continue;
    
    		dst_ptl = huge_pte_lock(h, dst, dst_pte);
    		src_ptl = huge_pte_lockptr(h, src, src_pte);
    		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
    		entry = huge_ptep_get(src_pte);
    		dst_entry = huge_ptep_get(dst_pte);
    		if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) {
    			/*
    			 * Skip if src entry none.  Also, skip in the
    			 * unlikely case dst entry !none as this implies
    			 * sharing with another vma.
    			 */
    			;
    		} else if (unlikely(is_hugetlb_entry_migration(entry) ||
    				    is_hugetlb_entry_hwpoisoned(entry))) {
    			swp_entry_t swp_entry = pte_to_swp_entry(entry);
    
    			if (is_write_migration_entry(swp_entry) && cow) {
    				/*
    				 * COW mappings require pages in both
    				 * parent and child to be set to read.
    				 */
    				make_migration_entry_read(&swp_entry);
    				entry = swp_entry_to_pte(swp_entry);
    				set_huge_swap_pte_at(src, addr, src_pte,
    						     entry, sz);
    			}
    			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
    		} else {
    			if (cow) {
    				/*
    				 * No need to notify as we are downgrading page
    				 * table protection not changing it to point
    				 * to a new page.
    				 *
    				 * See Documentation/vm/mmu_notifier.rst
    				 */
    				huge_ptep_set_wrprotect(src, addr, src_pte); //作用是将映射大页物理内存的最后一级页表设置为写保护
    			}
    			entry = huge_ptep_get(src_pte);
    			ptepage = pte_page(entry);
    			get_page(ptepage);
    			page_dup_rmap(ptepage, true);
    			set_huge_pte_at(dst, addr, dst_pte, entry);
    			hugetlb_count_add(pages_per_huge_page(h), dst);
    		}
    		spin_unlock(src_ptl);
    		spin_unlock(dst_ptl);
    	}
    
    	if (cow)
    		mmu_notifier_invalidate_range_end(src, mmun_start, mmun_end);
    
    	return ret;
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98

    从代码中可以看到,fork时会根据我们mmap时是否有VM_SHARED参数而确定后续是否需要cow,如果需要的话,会调用huge_ptep_set_wrprotect,先将父进程的页表项改为写保护,然后复制到子进程的页表项中,也就是父子进程对于该内存的页表项均是写保护;如果不需要的话,则父子进程对于该内存的页表项均是可读可写,所以可以得出情况1的结论。

    接下来分析情况2。
    通过对情况1的分析,我们知道,如果没有使用MAP_SHARED进行的映射,则父子进程对应的页表都是写保护的,那么,进行写操作时,就会触发到cow。在缺页中断中,会走到函数hugetlb_cow()

    /*
     * Hugetlb_cow() should be called with page lock of the original hugepage held.
     * Called with hugetlb_instantiation_mutex held and pte_page locked so we
     * cannot race with other handlers or page migration.
     * Keep the pte_same checks anyway to make transition from the mutex easier.
     */
    static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
    		       unsigned long address, pte_t *ptep,
    		       struct page *pagecache_page, spinlock_t *ptl)
    {
    	pte_t pte;
    	struct hstate *h = hstate_vma(vma);
    	struct page *old_page, *new_page;
    	int outside_reserve = 0;
    	vm_fault_t ret = 0;
    	unsigned long mmun_start;	/* For mmu_notifiers */
    	unsigned long mmun_end;		/* For mmu_notifiers */
    	unsigned long haddr = address & huge_page_mask(h);
    
    	pte = huge_ptep_get(ptep);
    	old_page = pte_page(pte);
    
    retry_avoidcopy:
    	/* If no-one else is actually using this page, avoid the copy
    	 * and just make the page writable */
    	if (page_mapcount(old_page) == 1 && PageAnon(old_page)) { //只有一个虚拟页映射&&匿名页;这个条件下只需要直接修改页表项
    		page_move_anon_rmap(old_page, vma);
    		set_huge_ptep_writable(vma, haddr, ptep);
    		return 0;
    	}
    
    	/*
    	 * If the process that created a MAP_PRIVATE mapping is about to
    	 * perform a COW due to a shared page count, attempt to satisfy
    	 * the allocation without using the existing reserves. The pagecache
    	 * page is used to determine if the reserve at this address was
    	 * consumed or not. If reserves were used, a partial faulted mapping
    	 * at the time of fork() could consume its reserves on COW instead
    	 * of the full address range.
    	 */
    	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
    			old_page != pagecache_page)
    		outside_reserve = 1;
    
    	get_page(old_page);
    
    	/*
    	 * Drop page table lock as buddy allocator may be called. It will
    	 * be acquired again before returning to the caller, as expected.
    	 */
    	spin_unlock(ptl);
    	new_page = alloc_huge_page(vma, haddr, outside_reserve); //分配巨型页
    
    	if (IS_ERR(new_page)) { //分配失败的话
    		/*
    		 * If a process owning a MAP_PRIVATE mapping fails to COW,
    		 * it is due to references held by a child and an insufficient
    		 * huge page pool. To guarantee the original mappers
    		 * reliability, unmap the page from child processes. The child
    		 * may get SIGKILLed if it later faults.
    		 */
    		if (outside_reserve) {
    			put_page(old_page);
    			BUG_ON(huge_pte_none(pte));
    			unmap_ref_private(mm, vma, old_page, haddr);
    			BUG_ON(huge_pte_none(pte));
    			spin_lock(ptl);
    			ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
    			if (likely(ptep &&
    				   pte_same(huge_ptep_get(ptep), pte)))
    				goto retry_avoidcopy;
    			/*
    			 * race occurs while re-acquiring page table
    			 * lock, and our job is done.
    			 */
    			return 0;
    		}
    
    		ret = vmf_error(PTR_ERR(new_page));
    		goto out_release_old;
    	}
    
    	/*
    	 * When the original hugepage is shared one, it does not have
    	 * anon_vma prepared.
    	 */
    	if (unlikely(anon_vma_prepare(vma))) {
    		ret = VM_FAULT_OOM;
    		goto out_release_all;
    	}
    
    	copy_user_huge_page(new_page, old_page, address, vma, //copy
    			    pages_per_huge_page(h));
    	__SetPageUptodate(new_page);
    
    	mmun_start = haddr;
    	mmun_end = mmun_start + huge_page_size(h);
    	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
    
    	/*
    	 * Retake the page table lock to check for racing updates
    	 * before the page tables are altered
    	 */
    	spin_lock(ptl);
    	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
    	if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
    		ClearPagePrivate(new_page);
    
    		/* Break COW */
    		huge_ptep_clear_flush(vma, haddr, ptep);
    		mmu_notifier_invalidate_range(mm, mmun_start, mmun_end);
    		set_huge_pte_at(mm, haddr, ptep, //映射到新页
    				make_huge_pte(vma, new_page, 1));
    		page_remove_rmap(old_page, true);
    		hugepage_add_new_anon_rmap(new_page, vma, haddr);
    		set_page_huge_active(new_page);
    		/* Make the old page be freed below */
    		new_page = old_page;
    	}
    	spin_unlock(ptl);
    	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
    out_release_all:
    	restore_reserve_on_error(h, vma, haddr, new_page);
    	put_page(new_page);
    out_release_old:
    	put_page(old_page);
    
    	spin_lock(ptl); /* Caller expects lock to be held */
    	return ret;
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130

    fork后第一个对该页进行写操作的进程,会走进缺页中断;该函数会分配一个hugepage,分配成功后,就调用copy_user_huge_page完成大页的复制工作,然后对大页进行一些初始化工作后,便可完成退出。
    fork后第二个对该页进行写操作的进程,也会走进缺页中断;我们注意标签retry_avoidcopy处的代码,这个时候会进行mapcount的判断,如果只有一个进程对这个页进行了映射,那么我们只需要简单的把写保护去掉(即配置成writable就可以了),然后函数即可退出。
    如果分配大页分配失败的话,会有两种情况:
    如果触发页错误异常的进程是创建私有映射的进程,那么删除所有子进程的映射,为子进程的虚拟内存区域的成员vm_private_data设置标志HPAGE_RESV_UNMAPPED,让子进程在发生页错误异常的时候被杀死。
    如果触发页错误异常的进程不是创建私有映射的进程,返回错误。

  • 相关阅读:
    CS201 USB TYPEC音频控制芯片|TYPEC拓展坞音频芯片|CS201参数特性
    国产无线蓝牙耳机哪个好?性价比高的国产耳机品牌
    Docker相关命令
    Go语言 Map教程
    【python】使用datafrom.plot直接画箱图
    一文带你了解web前端是如何制作表白网站(HTML+CSS+JS)
    计算机毕业设计(附源码)python疫情期间学生作业线上管理系统
    Cadence OrCAD Capture 原理图设计过程产生的文件总结与说明
    《真象还原》读书笔记——第八章 内存管理系统(字符串操作、位图定义与实现)
    WebRTC Native M96 基础Base模块介绍之网络相关的封装
  • 原文地址:https://blog.csdn.net/kaka__55/article/details/125470535