Details

Reviewers

alc
markj
kib

Commits

rGa8693e89e3e4: vm: Introduce vm_page_alloc_nofree_domain

Summary

This patch adds a reservation-based bump allocator inteded for NOFREE pages.
The main goal of this change is to reduce the long-term fragmentation issues caused by NOFREE slabs.

The new routine will hand out NOFREE slabs from a preallocated superpage.
Once an active NOFREE superpage fills up, the routine will try to allocate a new one and discard the old one.
This routine will get invoked whenever VM_ALLOC_NOFREE is passed to vm_page_alloc_noobj or vm_page_alloc.

Test Plan

Tested on amd64 using a bhyve vm.
No errors or panics were encountered while running vm-related stress2 tests for several hours.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

bnovkov created this revision.Jul 3 2024, 7:10 PM

Herald added a subscriber: imp. · View Herald TranscriptJul 3 2024, 7:10 PM

bnovkov requested review of this revision.Jul 3 2024, 7:10 PM

bnovkov added a parent revision: D45045: malloc(9): introduce M_NEVERFREED.

alc added inline comments.Jul 3 2024, 7:50 PM

sys/vm/uma_core.c
2110–2111 ↗	(On Diff #140531)	I would love to be able to call this from pmap_growkernel() for what are never freed page table pages, not all of which will be allocated early after boot.

did fragmentation drop though?

markj added inline comments.Jul 3 2024, 9:00 PM

sys/vm/uma_core.c
2110–2111 ↗	(On Diff #140531)	This suggests that the right place for this allocator is in vm_kern.c, used from uma_small_alloc() and page_alloc() in UMA, and from pmap_growkernel(). It'd effectively act like a bump allocator that grabs a large contiguous run of pages in order to refill.
2146 ↗	(On Diff #140531)	Why not allocate a contiguous 2MB chunk using vm_page_alloc_noobj_contig_domain()? I can't see any reason to use the low-level vm_phys allocator directly.
2170 ↗	(On Diff #140531)	This handles segregation for NOFREE zones with a 4KB slab size, but not for NOFREE zones with a larger slab size. For instance, `proc_zone` is NOFREE, and we have `vm.uma.PROC.keg.ppera: 4`. In particular, we may allocate more than one page per slab even when the item size is smaller than PAGE_SIZE, in order to decrease internal fragmentation.

alc added inline comments.Jul 3 2024, 10:34 PM

sys/vm/uma_core.c
2110–2111 ↗	(On Diff #140531)	My kneejerk reaction was: (1) introduce a VM_ALLOC_NOFREE flag and (2) move this code into vm_page_alloc_noobj_domain().
2121 ↗	(On Diff #140531)
2170 ↗	(On Diff #140531)	I could see having page_alloc() call a similarly modified vm_page_alloc_noobj_contig_domain() that supports VM_ALLOC_NOFREE. I would be fine if that vm_page_alloc_noobj_contig_domain() rejected requests that demanded particular low and high addresses.

alc added inline comments.Jul 3 2024, 10:36 PM

sys/vm/uma_core.c
2131 ↗	(On Diff #140531)

In D45863#1045744, @mjg wrote:

did fragmentation drop though?

Sorry, I forgot to link the benchmark results from a previous iteration of this patch.
The metrics I've gathered show that this approach does reduce NOFREE fragmentation.

sys/vm/uma_core.c
2146 ↗	(On Diff #140531)	Ah, force of habit :') I'll switch to the vm_page routine in the next iteration, thanks!
2170 ↗	(On Diff #140531)	Right, I'd like to go with @alc 's approach of adding VM_ALLOC_NOFREE and plugging the NOFREE allocator in `vm_page_alloc_noobj(_contig)`. That way M_NEVERFREED would get translated to VM_ALLOC_NOFREE which would then be handled automatically by the `vm_page` layer without the need to explicitly modify UMA or any other potential user. However there's one thing I'd like to clear up before proceeding - is the choice of not using DMAP for kegs with `ppera > 1` intentional? From what I can tell `page_alloc` always calls `kmem_malloc_domainset` which will always reach for a KVA arena. If this is choice is not intentional, we could add in a call to `vm_page_alloc_noobj_contig` and fall back to `kmem_malloc_domainset` if it fails.

markj added inline comments.Jul 12 2024, 2:34 PM

sys/vm/uma_core.c
2170 ↗	(On Diff #140531)	Sorry, I missed your question here. It's intentional - we don't want the allocation to fail if sufficient contiguous memory is not available. kmem_malloc_domainset() allocates from kernel_object, which has reservations enabled, so if a reservation is available, it'll return physically contiguous memory. We could try to allocate physically contiguous memory and fall back, but IMO that should be done in the kmem layer rather than UMA. That is, rather than having kmem_back_domain() allocate 4KB pages in a loop, we should have a vm_page_alloc* interface which lets the caller ask for an array of pages, and the implementation would try to return a physically contiguous run if possible.

bnovkov added inline comments.Jul 12 2024, 5:12 PM

sys/vm/uma_core.c
2170 ↗	(On Diff #140531)	Would you be okay with a temporary solution in UMA? I have a patch for a batched page allocation interface that's needs a bit more work but I don't want to drag the NOFREE patches out for too long. The batched page allocation patch has been sitting in my backlog for some time now, so this seems like a good opportunity to finish it and put it up for review.

alc added inline comments.Jul 12 2024, 7:57 PM

sys/vm/uma_core.c
2170 ↗	(On Diff #140531)	Once upon a time, I had expected vm_page_get_pages() to grow this capability, even when the pages were not coming from a reservation. In my opinion, @bnovkov you don't need to address both the ppera > 1 and the ppera == 1 cases in the same patch. Myself, I would approve a patch for the simpler ppera == 1 case, why you are still working on the more complicated ppera > 1 case.

Address @alc 's and @markj 's comments.

The NOFREE page allocator was moved to the vm_page layer and will now be invoked whenever VM_ALLOC_NOFREE is passed to vm_page_alloc_noobj or vm_page_alloc.
This should cover UMA kegs with both ppera == 1 and ppera > 1 since kmem_back_domain and uma_small_alloc use the previously listed interfaces to allocate pages.

bnovkov edited parent revisions, added: D45970: vm: Introduce {VM_ALLOC, PG}_NOFREE flags; removed: D45045: malloc(9): introduce M_NEVERFREED.Jul 14 2024, 1:42 PM

Sorry, I forgot to link the benchmark results from a previous iteration of this patch.
The metrics I've gathered show that this approach does reduce NOFREE fragmentation.

Few years back I ran buildkernel in a loop, several runs later fragmentation increased significantly to the point where the kernel was not able to use huge pages.

While technically not a blocker for this patch, something is definitely going wrong here -- the same workload in a loop should have stabilized its NOFREE usage as is after maybe 2-3 runs, not keep increasing it until some unknown bound. Someone(tm) should look into it, but admittedly this patch may happen to dodge the impact.

sys/vm/vm_page.c
176	this lacks padding -- `__aligned(CACHE_LINE_SIZE);` preferably this would be allocated from proper backing numa pages, but to my understanding there is no machinery present to make it painless

Here is the amd64 pmap change:

diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 9f85e903cd74..841957db3b3b 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -5154,8 +5154,8 @@ pmap_growkernel(vm_offset_t addr)
                pdpe = pmap_pdpe(kernel_pmap, end);
                if ((*pdpe & X86_PG_V) == 0) {
                        nkpg = pmap_alloc_pt_page(kernel_pmap,
-                           pmap_pdpe_pindex(end), VM_ALLOC_WIRED |
-                           VM_ALLOC_INTERRUPT | VM_ALLOC_ZERO);
+                           pmap_pdpe_pindex(end), VM_ALLOC_INTERRUPT |
+                               VM_ALLOC_NOFREE | VM_ALLOC_WIRED | VM_ALLOC_ZERO);
                        if (nkpg == NULL)
                                panic("pmap_growkernel: no memory to grow kernel");
                        paddr = VM_PAGE_TO_PHYS(nkpg);
@@ -5174,7 +5174,8 @@ pmap_growkernel(vm_offset_t addr)
                }
 
                nkpg = pmap_alloc_pt_page(kernel_pmap, pmap_pde_pindex(end),
-                   VM_ALLOC_WIRED | VM_ALLOC_INTERRUPT | VM_ALLOC_ZERO);
+                   VM_ALLOC_INTERRUPT | VM_ALLOC_NOFREE | VM_ALLOC_WIRED |
+                   VM_ALLOC_ZERO);
                if (nkpg == NULL)
                        panic("pmap_growkernel: no memory to grow kernel");
                paddr = VM_PAGE_TO_PHYS(nkpg);

sys/vm/vm_page.c
2512–2513	As an aside, I created vm_page_alloc_freelist{,_domain}() to support faster allocation of pages that were mapped by the partial direct map on 32-bit MIPS. So, they have not been used for some time. I expected that they might find other uses too, but those other uses have never materialized. Instead, people use the more general vm_page_alloc_contig(). Do we want to retire vm_page_alloc_freelist{,_domain}()?

alc added inline comments.Jul 14 2024, 9:04 PM

sys/vm/vm_page.c
2120–2125	I'm making a comment here for the lack of a better place. I assume that we get here via uma_core.c's page_alloc(). In that case, we really want kmem_malloc_domain() to get the virtual address for M_NEVERFREED requests from an arena other than vm_dom[domain].vmd_kernel_{rwx_}arena. Otherwise, we are likely a 2MB region that is mostly backed by a reservation, but will now be forever unpromotable.
sys/vm/vm_page.h
624 ↗	(On Diff #140873)	Does this really need to be public? Also, unlike the other functions here, it does not return a fully initialized page.

In D45863#1047954, @mjg wrote:

Sorry, I forgot to link the benchmark results from a previous iteration of this patch.
The metrics I've gathered show that this approach does reduce NOFREE fragmentation.

Few years back I ran buildkernel in a loop, several runs later fragmentation increased significantly to the point where the kernel was not able to use huge pages.

While technically not a blocker for this patch, something is definitely going wrong here -- the same workload in a loop should have stabilized its NOFREE usage as is after maybe 2-3 runs, not keep increasing it until some unknown bound. Someone(tm) should look into it, but admittedly this patch may happen to dodge the impact.

I've been cross-compiling a lot of arm64 kernels lately, and have observed a surprising number of broken reservations, a lot more during a single buildkernel than an entire buildworld. I haven't determined the cause, but it is not because the machine is short of memory. If the unused pages from the broken reservation get allocated before the used ones are eventually freed, then we've likely lost that contiguous chunk until/unless compaction is performed or we switch to reservation breaking as described at Quicksilver paper.

Address @alc 's and @mjg 's comments.

In D45863#1048057, @alc wrote:

Here is the amd64 pmap change:

Thank you!
I'll do the same for the other pmaps and bundle all changes in a separate revision.

sys/vm/vm_page.c
2120–2125	Right, adding a 'nofree' KVA arena should do the trick. I'll land this in a separate revision.

bnovkov added a child revision: D45997: vm: Add a KVA arena for M_NEVERFREED allocations.Jul 17 2024, 3:39 PM

bnovkov added a child revision: D45998: pmap_growkernel: Use VM_ALLOC_NOFREE when allocating pagetable pages.Jul 17 2024, 3:55 PM

markj added inline comments.Jul 21 2024, 4:46 PM

sys/vm/vm_page.c
166	Extra newline.
176	I'm not sure why such optimizations are important for a structure that is, by definition, going to be referenced only a finite number of times during a system's uptime, however long it is.
2120	This kind of allocation will be rare so should be annotated with `__predict_false`.
2452	Same comment here about the branch hint.
2512–2513	I'm in support of removing those functions.
2549	I don't really see why this functionality should be dependent on `VM_NRESERVLEVEL > 0`. It seems to me that we could do something similar to the definition of `KVA_QUANTUM`.
2557	Do we really need a separate lock here? I would have just used the per-domain vm_phys lock to protect the bump allocator's state.

alc added inline comments.Jul 21 2024, 8:45 PM

sys/vm/vm_page.c
2549	Arguably, even when `VM_NRESERV_LEVEL == 0`, we still want to segregate nofree allocations for the sake of `vm_page_alloc_contig()/contigmalloc()`. Also, when `VM_NRESERV_LEVEL == 0`, I believe the compiler is going to issue a warning that this function is unused.
2558	By default, `VM_LEVEL_0_ORDER` is now 64KB on arm64. I agree with @markj 's suggestion to define a `KVA_QUANTUM`-like constant that would be defined appropriately for each of `VM_LEVEL_0_ORDER == 0`, `1`, or `2`.

Address @alc 's and @markj 's comments:

nofree queues are now guarded by per-domain vm_phys locks
NOFREE pages will now get allocated on all systems. I've added a VM_NOFREE_IMPORT_ORDER macro using the same logic for calculating KVA_QUANTUM to align with the KVA import sizes.

alc added inline comments.Jul 25 2024, 4:01 AM

sys/vm/vm_page.c
171–174	I would consider including this struct in the vm_domain, rather than being a (padded) stand-alone array. @markj What do you think?
198	Could you please move this up after vm_page_alloc_check() to maintain the mostly sorted order.
2568	Need to deindent by 4 positions.

markj added inline comments.Jul 25 2024, 2:12 PM

sys/vm/vm_page.c
171–174	Indeed, I think that makes sense. Maybe we want a way to add global per-NUMA domain variables, like we do with PCPU/DPCPU, but until that day comes it's better to keep everything VM-related together.

Address @alc 's comments.
The bump allocator state was moved to struct vm_domain.
I was not entirely sure where to place the nofreeq struct within struct vm_domain, so please let me know if you think there's a better position w.r.t. cache usage.

In D45863#1051546, @bnovkov wrote:

I was not entirely sure where to place the nofreeq struct within struct vm_domain, so please let me know if you think there's a better position w.r.t. cache usage.

The nearby fields are constants. However, accesses to nofreeq become increasingly rare as time goes on, so placing it in otherwise unused space created by __aligned(CACHE_LINE_SIZE) makes sense to me.

I've tested the entire collection of patches this weekend, and everything seemed fine. As far as I'm concerned, the patches are ready for committing.

Later, I'd like to see some counters added to track the number of nofree allocations. Also, the one downside that I see is that the pages allocated for kmem_malloc() will never be promotable to a superpage.

This revision is now accepted and ready to land.Jul 28 2024, 6:47 PM

In D45863#1052338, @alc wrote:

I've tested the entire collection of patches this weekend, and everything seemed fine. As far as I'm concerned, the patches are ready for committing.

Later, I'd like to see some counters added to track the number of nofree allocations. Also, the one downside that I see is that the pages allocated for kmem_malloc() will never be promotable to a superpage.

Thank you for testing the changes!
I'll land the counters in a separate revision.

Closed by commit rGa8693e89e3e4: vm: Introduce vm_page_alloc_nofree_domain (authored by bnovkov). · Explain WhyJul 30 2024, 3:39 PM

This revision was automatically updated to reflect the committed changes.

bnovkov added a commit: rGa8693e89e3e4: vm: Introduce vm_page_alloc_nofree_domain.

vm: Introduce reservation-aware NOFREE page allocation routine
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 141318

sys/vm/vm_page.c

vm: Introduce reservation-aware NOFREE page allocation routineClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 141318

sys/vm/vm_page.c

vm: Introduce reservation-aware NOFREE page allocation routine
ClosedPublic
Actions

Revision Contents
Changeset List