Page MenuHomeFreeBSD

vm: Introduce reservation-aware NOFREE page allocation routine
ClosedPublic

Authored by bnovkov on Jul 3 2024, 7:10 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jan 27, 4:34 PM
Unknown Object (File)
Mon, Jan 27, 4:34 PM
Unknown Object (File)
Fri, Jan 24, 5:43 PM
Unknown Object (File)
Fri, Jan 24, 5:42 PM
Unknown Object (File)
Sat, Jan 18, 1:59 AM
Unknown Object (File)
Thu, Jan 16, 7:19 AM
Unknown Object (File)
Mon, Jan 6, 9:08 PM
Unknown Object (File)
Dec 18 2024, 1:00 PM
Subscribers

Details

Summary

This patch adds a reservation-based bump allocator inteded for NOFREE pages.
The main goal of this change is to reduce the long-term fragmentation issues caused by NOFREE slabs.

The new routine will hand out NOFREE slabs from a preallocated superpage.
Once an active NOFREE superpage fills up, the routine will try to allocate a new one and discard the old one.
This routine will get invoked whenever VM_ALLOC_NOFREE is passed to vm_page_alloc_noobj or vm_page_alloc.

Test Plan

Tested on amd64 using a bhyve vm.
No errors or panics were encountered while running vm-related stress2 tests for several hours.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sys/vm/uma_core.c
2110–2111 ↗(On Diff #140531)

I would love to be able to call this from pmap_growkernel() for what are never freed page table pages, not all of which will be allocated early after boot.

did fragmentation drop though?

sys/vm/uma_core.c
2110–2111 ↗(On Diff #140531)

This suggests that the right place for this allocator is in vm_kern.c, used from uma_small_alloc() and page_alloc() in UMA, and from pmap_growkernel(). It'd effectively act like a bump allocator that grabs a large contiguous run of pages in order to refill.

2146 ↗(On Diff #140531)

Why not allocate a contiguous 2MB chunk using vm_page_alloc_noobj_contig_domain()? I can't see any reason to use the low-level vm_phys allocator directly.

2170 ↗(On Diff #140531)

This handles segregation for NOFREE zones with a 4KB slab size, but not for NOFREE zones with a larger slab size. For instance, proc_zone is NOFREE, and we have vm.uma.PROC.keg.ppera: 4. In particular, we may allocate more than one page per slab even when the item size is smaller than PAGE_SIZE, in order to decrease internal fragmentation.

sys/vm/uma_core.c
2110–2111 ↗(On Diff #140531)

My kneejerk reaction was: (1) introduce a VM_ALLOC_NOFREE flag and (2) move this code into vm_page_alloc_noobj_domain().

2121 ↗(On Diff #140531)
2170 ↗(On Diff #140531)

I could see having page_alloc() call a similarly modified vm_page_alloc_noobj_contig_domain() that supports VM_ALLOC_NOFREE. I would be fine if that vm_page_alloc_noobj_contig_domain() rejected requests that demanded particular low and high addresses.

sys/vm/uma_core.c
2131 ↗(On Diff #140531)
In D45863#1045744, @mjg wrote:

did fragmentation drop though?

Sorry, I forgot to link the benchmark results from a previous iteration of this patch.
The metrics I've gathered show that this approach does reduce NOFREE fragmentation.

sys/vm/uma_core.c
2146 ↗(On Diff #140531)

Ah, force of habit :')
I'll switch to the vm_page routine in the next iteration, thanks!

2170 ↗(On Diff #140531)

Right, I'd like to go with @alc 's approach of adding VM_ALLOC_NOFREE and plugging the NOFREE allocator in vm_page_alloc_noobj(_contig). That way M_NEVERFREED would get translated to VM_ALLOC_NOFREE which would then be handled automatically by the vm_page layer without the need to explicitly modify UMA or any other potential user.

However there's one thing I'd like to clear up before proceeding - is the choice of not using DMAP for kegs with ppera > 1 intentional? From what I can tell page_alloc always calls kmem_malloc_domainset which will always reach for a KVA arena.
If this is choice is not intentional, we could add in a call to vm_page_alloc_noobj_contig and fall back to kmem_malloc_domainset if it fails.

sys/vm/uma_core.c
2170 ↗(On Diff #140531)

Sorry, I missed your question here.

It's intentional - we don't want the allocation to fail if sufficient contiguous memory is not available. kmem_malloc_domainset() allocates from kernel_object, which has reservations enabled, so if a reservation is available, it'll return physically contiguous memory.

We could try to allocate physically contiguous memory and fall back, but IMO that should be done in the kmem layer rather than UMA. That is, rather than having kmem_back_domain() allocate 4KB pages in a loop, we should have a vm_page_alloc* interface which lets the caller ask for an array of pages, and the implementation would try to return a physically contiguous run if possible.

sys/vm/uma_core.c
2170 ↗(On Diff #140531)

Would you be okay with a temporary solution in UMA?
I have a patch for a batched page allocation interface that's needs a bit more work but I don't want to drag the NOFREE patches out for too long.
The batched page allocation patch has been sitting in my backlog for some time now, so this seems like a good opportunity to finish it and put it up for review.

sys/vm/uma_core.c
2170 ↗(On Diff #140531)

Once upon a time, I had expected vm_page_get_pages() to grow this capability, even when the pages were not coming from a reservation.

In my opinion, @bnovkov you don't need to address both the ppera > 1 and the ppera == 1 cases in the same patch. Myself, I would approve a patch for the simpler ppera == 1 case, why you are still working on the more complicated ppera > 1 case.

bnovkov retitled this revision from uma: Add reservation-based NOFREE slab segregation to vm: Introduce reservation-aware NOFREE page allocation routine.
bnovkov edited the summary of this revision. (Show Details)

Address @alc 's and @markj 's comments.

The NOFREE page allocator was moved to the vm_page layer and will now be invoked whenever VM_ALLOC_NOFREE is passed to vm_page_alloc_noobj or vm_page_alloc.
This should cover UMA kegs with both ppera == 1 and ppera > 1 since kmem_back_domain and uma_small_alloc use the previously listed interfaces to allocate pages.

Sorry, I forgot to link the benchmark results from a previous iteration of this patch.
The metrics I've gathered show that this approach does reduce NOFREE fragmentation.

Few years back I ran buildkernel in a loop, several runs later fragmentation increased significantly to the point where the kernel was not able to use huge pages.

While technically not a blocker for this patch, something is definitely going wrong here -- the same workload in a loop should have stabilized its NOFREE usage as is after maybe 2-3 runs, not keep increasing it until some unknown bound. Someone(tm) should look into it, but admittedly this patch may happen to dodge the impact.

sys/vm/vm_page.c
176

this lacks padding -- __aligned(CACHE_LINE_SIZE);

preferably this would be allocated from proper backing numa pages, but to my understanding there is no machinery present to make it painless

Here is the amd64 pmap change:

diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 9f85e903cd74..841957db3b3b 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -5154,8 +5154,8 @@ pmap_growkernel(vm_offset_t addr)
                pdpe = pmap_pdpe(kernel_pmap, end);
                if ((*pdpe & X86_PG_V) == 0) {
                        nkpg = pmap_alloc_pt_page(kernel_pmap,
-                           pmap_pdpe_pindex(end), VM_ALLOC_WIRED |
-                           VM_ALLOC_INTERRUPT | VM_ALLOC_ZERO);
+                           pmap_pdpe_pindex(end), VM_ALLOC_INTERRUPT |
+                               VM_ALLOC_NOFREE | VM_ALLOC_WIRED | VM_ALLOC_ZERO);
                        if (nkpg == NULL)
                                panic("pmap_growkernel: no memory to grow kernel");
                        paddr = VM_PAGE_TO_PHYS(nkpg);
@@ -5174,7 +5174,8 @@ pmap_growkernel(vm_offset_t addr)
                }
 
                nkpg = pmap_alloc_pt_page(kernel_pmap, pmap_pde_pindex(end),
-                   VM_ALLOC_WIRED | VM_ALLOC_INTERRUPT | VM_ALLOC_ZERO);
+                   VM_ALLOC_INTERRUPT | VM_ALLOC_NOFREE | VM_ALLOC_WIRED |
+                   VM_ALLOC_ZERO);
                if (nkpg == NULL)
                        panic("pmap_growkernel: no memory to grow kernel");
                paddr = VM_PAGE_TO_PHYS(nkpg);
sys/vm/vm_page.c
2512–2513

As an aside, I created vm_page_alloc_freelist{,_domain}() to support faster allocation of pages that were mapped by the partial direct map on 32-bit MIPS. So, they have not been used for some time. I expected that they might find other uses too, but those other uses have never materialized. Instead, people use the more general vm_page_alloc_contig().

Do we want to retire vm_page_alloc_freelist{,_domain}()?

sys/vm/vm_page.c
2120–2125

I'm making a comment here for the lack of a better place. I assume that we get here via uma_core.c's page_alloc(). In that case, we really want kmem_malloc_domain() to get the virtual address for M_NEVERFREED requests from an arena other than vm_dom[domain].vmd_kernel_{rwx_}arena. Otherwise, we are likely a 2MB region that is mostly backed by a reservation, but will now be forever unpromotable.

sys/vm/vm_page.h
624 ↗(On Diff #140873)

Does this really need to be public? Also, unlike the other functions here, it does not return a fully initialized page.

In D45863#1047954, @mjg wrote:

Sorry, I forgot to link the benchmark results from a previous iteration of this patch.
The metrics I've gathered show that this approach does reduce NOFREE fragmentation.

Few years back I ran buildkernel in a loop, several runs later fragmentation increased significantly to the point where the kernel was not able to use huge pages.

While technically not a blocker for this patch, something is definitely going wrong here -- the same workload in a loop should have stabilized its NOFREE usage as is after maybe 2-3 runs, not keep increasing it until some unknown bound. Someone(tm) should look into it, but admittedly this patch may happen to dodge the impact.

I've been cross-compiling a lot of arm64 kernels lately, and have observed a surprising number of broken reservations, a lot more during a single buildkernel than an entire buildworld. I haven't determined the cause, but it is not because the machine is short of memory. If the unused pages from the broken reservation get allocated before the used ones are eventually freed, then we've likely lost that contiguous chunk until/unless compaction is performed or we switch to reservation breaking as described at Quicksilver paper.

In D45863#1048057, @alc wrote:

Here is the amd64 pmap change:

Thank you!
I'll do the same for the other pmaps and bundle all changes in a separate revision.

sys/vm/vm_page.c
2120–2125

Right, adding a 'nofree' KVA arena should do the trick.
I'll land this in a separate revision.

sys/vm/vm_page.c
166

Extra newline.

176

I'm not sure why such optimizations are important for a structure that is, by definition, going to be referenced only a finite number of times during a system's uptime, however long it is.

2120

This kind of allocation will be rare so should be annotated with __predict_false.

2452

Same comment here about the branch hint.

2512–2513

I'm in support of removing those functions.

2549

I don't really see why this functionality should be dependent on VM_NRESERVLEVEL > 0. It seems to me that we could do something similar to the definition of KVA_QUANTUM.

2557

Do we really need a separate lock here? I would have just used the per-domain vm_phys lock to protect the bump allocator's state.

sys/vm/vm_page.c
2549

Arguably, even when VM_NRESERV_LEVEL == 0, we still want to segregate nofree allocations for the sake of vm_page_alloc_contig()/contigmalloc(). Also, when VM_NRESERV_LEVEL == 0, I believe the compiler is going to issue a warning that this function is unused.

2558

By default, VM_LEVEL_0_ORDER is now 64KB on arm64. I agree with @markj 's suggestion to define a KVA_QUANTUM-like constant that would be defined appropriately for each of VM_LEVEL_0_ORDER == 0, 1, or 2.

Address @alc 's and @markj 's comments:

  • nofree queues are now guarded by per-domain vm_phys locks
  • NOFREE pages will now get allocated on all systems. I've added a VM_NOFREE_IMPORT_ORDER macro using the same logic for calculating KVA_QUANTUM to align with the KVA import sizes.
sys/vm/vm_page.c
171–174

I would consider including this struct in the vm_domain, rather than being a (padded) stand-alone array. @markj What do you think?

198

Could you please move this up after vm_page_alloc_check() to maintain the mostly sorted order.

2568

Need to deindent by 4 positions.

sys/vm/vm_page.c
171–174

Indeed, I think that makes sense. Maybe we want a way to add global per-NUMA domain variables, like we do with PCPU/DPCPU, but until that day comes it's better to keep everything VM-related together.

Address @alc 's comments.
The bump allocator state was moved to struct vm_domain.
I was not entirely sure where to place the nofreeq struct within struct vm_domain, so please let me know if you think there's a better position w.r.t. cache usage.

I was not entirely sure where to place the nofreeq struct within struct vm_domain, so please let me know if you think there's a better position w.r.t. cache usage.

The nearby fields are constants. However, accesses to nofreeq become increasingly rare as time goes on, so placing it in otherwise unused space created by __aligned(CACHE_LINE_SIZE) makes sense to me.

I've tested the entire collection of patches this weekend, and everything seemed fine. As far as I'm concerned, the patches are ready for committing.

Later, I'd like to see some counters added to track the number of nofree allocations. Also, the one downside that I see is that the pages allocated for kmem_malloc() will never be promotable to a superpage.

This revision is now accepted and ready to land.Jul 28 2024, 6:47 PM
In D45863#1052338, @alc wrote:

I've tested the entire collection of patches this weekend, and everything seemed fine. As far as I'm concerned, the patches are ready for committing.

Later, I'd like to see some counters added to track the number of nofree allocations. Also, the one downside that I see is that the pages allocated for kmem_malloc() will never be promotable to a superpage.

Thank you for testing the changes!
I'll land the counters in a separate revision.

This revision was automatically updated to reflect the committed changes.