Page MenuHomeFreeBSD

arm64 pmap: Add ATTR_CONTIGUOUS support [Part 2]
ClosedPublic

Authored by alc on Apr 1 2024, 5:26 AM.
Tags
None
Referenced Files
F103023506: D44575.diff
Tue, Nov 19, 11:05 PM
Unknown Object (File)
Mon, Nov 18, 5:01 PM
Unknown Object (File)
Mon, Nov 4, 4:56 AM
Unknown Object (File)
Mon, Nov 4, 4:55 AM
Unknown Object (File)
Mon, Nov 4, 4:55 AM
Unknown Object (File)
Mon, Nov 4, 4:55 AM
Unknown Object (File)
Mon, Nov 4, 4:55 AM
Unknown Object (File)
Mon, Nov 4, 4:55 AM

Details

Summary

Create ATTR_CONTIGUOUS mappings in pmap_enter_object().

On a Cortex A72-based EC2 a1.4xlarge VM, the make -j16 buildworld times for a -NODEBUG kernel and MALLOC_PRODUCTION user-space drop from 101 minutes to 94 minutes. The read-only data and text sections of large (2 MB+) executables, e.g., clang, are being mapped using 64 KB pages. I do, however, want to point out that a good portion of the reduction in buildworld time is coming from performing a smaller number of icache flushes when creating executable mappings.

Also included is a BTI-related bug fix to pmap_enter_l2().

Test Plan

Here are some stats after make -j8 buildworld from a GENERIC kernel configured with a 16 KB base page size on an EC2 m7g.2xlarge.

23601.828u 665.253s 53:45.06 752.4%     89071+1326k 152492+35485io 133520pf+0w
Mon Apr  1 19:56:39 UTC 2024
vm.pmap.l3c.copies: 0
vm.pmap.l3c.protects: 0
vm.pmap.l3c.removes: 0
vm.pmap.l3c.mappings: 104719
vm.pmap.l3c.demotions: 20
vm.pmap.l2.fills: 3
vm.pmap.l2.promotions: 1859
vm.pmap.l2.p_failures: 1633
vm.pmap.l2.mappings: 187
vm.pmap.l2.demotions: 1083
vm.pmap.superpages_enabled: 1
vm.pmap.vmid.epoch: 0
vm.pmap.vmid.next: 0
vm.pmap.vmid.bits: 0
vm.pmap.asid.epoch: 0
vm.pmap.asid.next: 64341
vm.pmap.asid.bits: 16
vm.reserv.reclaimed: 0
vm.reserv.partpopq: 
DOMAIN    LEVEL     SIZE  NUMBER

     0,      -1, 4793472K,    151

vm.reserv.fullpop: 9
vm.reserv.freed: 378421
vm.reserv.broken: 4

The L3C size is 2 MB, and the number of L3C mappings is very close to the number of L2 mappings on a kernel configured with a 4 KB base page size, which is what I would expect.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

alc requested review of this revision.Apr 1 2024, 5:26 AM
sys/arm64/arm64/pmap.c
5460–5462 ↗(On Diff #136398)

As part of a future change that updates pagesizes[] to include L3C_SIZE, pmap_enter(..., psind=1) will also call this function.

5689–5690 ↗(On Diff #136398)

When pagesizes[] is updated to include L3C_SIZE and vm_reserv.c is updated to set psind=1 on L3C pages, I expect this to switch to checking the page's psind field.

I do, however, want to point out that a good portion of the reduction in buildworld time is coming from performing a smaller number of icache flushes when creating executable mappings.

Have you looked at teaching the vm code to manage the icache? We currently call cpu_icache_sync_range more than we need to, e.g. if mapping the same physical address as twice we will call it twice.

sys/arm64/arm64/pmap.c
1697–1711 ↗(On Diff #136398)

XXX?

It would also be good it we didn't list the size as it will be different based on the granule size.

5429 ↗(On Diff #136398)

Can we name this something that doesn't list the size as it may be incorrect.

sys/arm64/arm64/pmap.c
1697–1711 ↗(On Diff #136398)

The XXX counters are here just for testing purposes. They were in the prior part too. I will delete them before committing the changes.

Rename pmap_enter_object()'s helpers to not have 64k or 2m in their names.

I do, however, want to point out that a good portion of the reduction in buildworld time is coming from performing a smaller number of icache flushes when creating executable mappings.

Have you looked at teaching the vm code to manage the icache? We currently call cpu_icache_sync_range more than we need to, e.g. if mapping the same physical address as twice we will call it twice.

We discussed this a while ago. To minimize icache syncing, I believe we need to identify all of the places where a kernel might modify a user-mapped page via the direct map. I think that hooking uiomove_* would get us most of the way there, but it's hard to be confident that that's sufficient.

sys/arm64/arm64/pmap.c
5594 ↗(On Diff #136491)

Do we want to set *ml3p = NULL in this error path as well?

sys/vm/vm_reserv.c
1072

Perhaps also assert that npages <= VM_LEVEL_0_NPAGES?

Add KASSERT to vm_reserv_is_populated().

alc marked 2 inline comments as done.Apr 6 2024, 5:19 AM
alc added inline comments.
sys/arm64/arm64/pmap.c
5594 ↗(On Diff #136491)

We did not because the PTP pointer will still be valid. The PTP can't have been freed. Curiously, pmap_enter_quick_locked() does however return NULL in this case even though it too need not.

alc marked an inline comment as done.Apr 6 2024, 5:49 AM

I do, however, want to point out that a good portion of the reduction in buildworld time is coming from performing a smaller number of icache flushes when creating executable mappings.

Have you looked at teaching the vm code to manage the icache? We currently call cpu_icache_sync_range more than we need to, e.g. if mapping the same physical address as twice we will call it twice.

We discussed this a while ago. To minimize icache syncing, I believe we need to identify all of the places where a kernel might modify a user-mapped page via the direct map. I think that hooking uiomove_* would get us most of the way there, but it's hard to be confident that that's sufficient.

Yes, exactly. After the next couple of ATTR_CONTIGUOUS pieces are done, or at least up for review, I'll post a draft icache management patch so that we can begin working on our confidence. :-)

sys/arm64/arm64/pmap.c
5574 ↗(On Diff #136588)

The 14 lines starting here assume that *ml3p is non-NULL, and do not read l3p. So they could be moved up to after the assignment to l3p in line 5548, and limited to the !ADDR_IS_KERNEL case. You'd have to move the have_lp3 target to just past line 5548 as well.

5592 ↗(On Diff #136588)

The variable old_l3e is used only in this line, so I suspect that it can be dropped.

sys/arm64/arm64/pmap.c
5469 ↗(On Diff #136588)

Is l2p necessary? Couldn't it be replaced by pde everywhere, since once l2p is computed from pde, that value of pde is not used again.

sys/arm64/arm64/pmap.c
5513 ↗(On Diff #136588)

#define PTE_TO_VM_PAGE(pte) PHYS_TO_VM_PAGE(PTE_TO_PHYS(pte))

would be a useful macro that would save a lot of horizontal space.

sys/arm64/arm64/pmap.c
5494 ↗(On Diff #136588)

How to write this section without gotos, in case you want to:

` l2pindex = pmap_l2_pindex(va);

		l3p = NULL;
		if (*ml3p != NULL && (*ml3p)->pindex != l2pindex)
			ml3p = NULL;
		while (*ml3p == NULL) {
			/*
			 * Get the L2 entry.
			 */
			pde = pmap_pde(pmap, va, &lvl);

			/*
			 * If the L2 entry is a superpage, we either abort or
			 * demote depending on the given flags.
			 */
			if (lvl == 1) {
				pde = pmap_l1_to_l2(pde, va);
				if ((pmap_load(pde) & ATTR_DESCR_MASK) ==
				    L2_BLOCK) {
					if ((flags & PMAP_ENTER_NOREPLACE) != 0)
						return (KERN_FAILURE);
					l3p = pmap_demote_l2_locked(pmap, pde,
					    va, lockp);
					if (l3p == NULL)
						pde = NULL;
				}
			} else if (lvl == 2) {
				if (pmap_load(pde) == 0)
					pde = NULL;
			} else
				pde = NULL;

			/*
			 * If the L3 PTP is not mapped, we attempt to allocate it.
			 */
			if (pde != NULL) {
				*ml3p = PHYS_TO_VM_PAGE(PTE_TO_PHYS(
				    pmap_load(pde)));
			} else {
				*ml3p = _pmap_alloc_l3(pmap, l2pindex, (flags &
				    PMAP_ENTER_NOSLEEP) != 0 ? NULL : lockp);
				if (*ml3p != NULL) {
					--(*ml3p)->ref_count;
				} else if ((flags & PMAP_ENTER_NOSLEEP) != 0)
					return (KERN_FAILURE);
			}
		}
		(*ml3p)->ref_count += L3C_ENTRIES;

		/*
		 * If bti is not the same for the whole L3C range, return failure
		 * and let vm_fault() cope.  Check after L3 allocation, since
		 * it could sleep.
		 */
		if (!pmap_bti_same(pmap, va, va + L3C_SIZE)) {
			KASSERT(*ml3p != NULL, ("pmap_enter_l3c: missing L3 PTP"));
			(*ml3p)->ref_count -= L3C_ENTRIES - 1;
			pmap_abort_ptp(pmap, va, *ml3p);
			*ml3p = NULL;
			return (KERN_PROTECTION_FAILURE);
		}
		if (l3p == NULL)
			l3p = (pt_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(*ml3p));

`

Eliminate an unnecessary variable.

alc marked 2 inline comments as done.Apr 6 2024, 10:53 PM
alc added inline comments.
sys/arm64/arm64/pmap.c
5513 ↗(On Diff #136588)

I would be happy to see this added as a followup change.

alc marked 2 inline comments as done.Apr 6 2024, 10:58 PM
alc added inline comments.
sys/arm64/arm64/pmap.c
5469 ↗(On Diff #136588)

It is more descriptive than pde, which can point to either an L1 or L2 entry.

alc marked an inline comment as done.Apr 6 2024, 10:59 PM
alc marked an inline comment as done.Apr 6 2024, 11:02 PM
sys/arm64/arm64/pmap.c
5494 ↗(On Diff #136588)

I wish to edit this bit of code, which I can only do by pasting another slightly altered version.

		l3p = NULL;
		if (*ml3p != NULL && (*ml3p)->pindex != l2pindex)
			ml3p = NULL;
		while (*ml3p == NULL) {
			/*
			 * Get the L2 entry.
			 */
			pde = pmap_pde(pmap, va, &lvl);
			KASSERT(lvl == 1 || lvl == 2,
			    ("pmap_enter_l3c: Invalid level %d", lvl));

			/*
			 * If the L2 entry is a superpage, we either abort or
			 * demote depending on the given flags.
			 */
			if (lvl == 1) {
				pde = pmap_l1_to_l2(pde, va);
				if ((pmap_load(pde) & ATTR_DESCR_MASK) !=
				    L2_BLOCK)
					pde = NULL;
				else if ((flags & PMAP_ENTER_NOREPLACE) != 0)
					return (KERN_FAILURE);
				else if ((l3p = pmap_demote_l2_locked(pmap,
					    pde, va, lockp)) == NULL)
					pde = NULL;
			} else if (pmap_load(pde) == 0)
				pde = NULL;

			/*
			 * If the L3 PTP is not mapped, we attempt to allocate
			 * it.
			 */
			if (pde != NULL) {
				*ml3p = PHYS_TO_VM_PAGE(PTE_TO_PHYS(
				    pmap_load(pde)));
			} else {
				*ml3p = _pmap_alloc_l3(pmap, l2pindex, (flags &
				    PMAP_ENTER_NOSLEEP) != 0 ? NULL : lockp);
				if (*ml3p != NULL) {
					--(*ml3p)->ref_count;
				} else if ((flags & PMAP_ENTER_NOSLEEP) != 0)
					return (KERN_FAILURE);
			}
		}
		(*ml3p)->ref_count += L3C_ENTRIES;

		/*
		 * If bti is not the same for the whole L3C range, return failure
		 * and let vm_fault() cope.  Check after L3 allocation, since
		 * it could sleep.
		 */
		if (!pmap_bti_same(pmap, va, va + L3C_SIZE)) {
			KASSERT(*ml3p != NULL, ("pmap_enter_l3c: missing L3 PTP"));
			(*ml3p)->ref_count -= L3C_ENTRIES - 1;
			pmap_abort_ptp(pmap, va, *ml3p);
			*ml3p = NULL;
			return (KERN_PROTECTION_FAILURE);
		}
		if (l3p == NULL)
			l3p = (pt_entry_t *)PHYS_TO_DMAP(VM_PAGE_TO_PHYS(*ml3p));
This revision was not accepted when it landed; it landed in state Needs Review.Apr 7 2024, 5:29 PM
This revision was automatically updated to reflect the committed changes.

Update to reflect committed change.

This revision was not accepted when it landed; it landed in state Needs Review.Apr 9 2024, 4:22 PM
This revision was automatically updated to reflect the committed changes.