Details

Reviewers

jeff
kib
alc
rlibby
jhb

Summary

Currently we allocate a single 4KB page for each per-CPU structure and
place them in an array. UMA's dynamic per-CPU allocator returns arrays
of 4KB pages that mirror the base array, so per-CPU data can be
referenced relative to the location of the base pcpu structure.

There is no requirement that the base pcpu structures must be
contiguous, however. This commit adds some code to dynamically lay out
the base pcpu structures with the aim of using 2MB mappings for all
per-CPU data (base pcpu, DPCPU, and UMA slabs). This has the following
benefits:

1. Improved TLB efficiency. During boot a GENERIC kernel allocates
roughly 64KB of data per CPU, mostly for counter(9) and malloc(9)
stats. Currently all of that data is mapped using 4KB pages. The
amount of per-CPU data allocated during boot is only going to grow
over time, and some subsystems, e.g., pf, VFS, perform many dynamic
allocations of per-CPU data.

2. Better control over kernel memory fragmentation. Previously,
dynamically allocated per-CPU slabs were allocated directly from the
page allocator, so their placement is effectively random, and per-CPU
data structures tend to be long-lived.

3. The DPCPU indirection is removed. Previously, accessing a DPCPU
field involved an extra memory access. With this patch the DPCPU
region immediately follows the base pcpu structure, so its offset
relative to the base is known at compile-time.

4. Proper NUMA affinity for per-CPU structures allocated early during
boot. This allocator provides a bootstrap allocator which always
provides domain-correct memory. Previously, anything allocated
during SI_SUB_CPU or earlier would always be entirely on domain 0.

5. It allows the UMA per-CPU slab size limit of 4KB to be increased, if
that ever becomes useful.

CPUs are grouped by domain into 2MB pages. Initially, MAXCPU * NBPDR
bytes of KVA are reserved for per-CPU data, but this is a worst case and
in practice most of it is not used. Per-CPU data is always mapped into
ranges of 2MB pages. For example, with 4 CPUs and 2 domains, the
allocator always allocates 4MB of KVA at once. Usually the KVA quantum
will be 2MB * vm_ndomains, but it may be larger if there are many CPUs
in a domain. The initial segment for a given CPU looks like this:

| pcpu 0 | dpcpu 0 | UMA per-CPU slabs ... | pcpu 1 | dpcpu 1 | ...

Early during boot we allocate 2MB of physical memory for the BSP, mapped
at VM_MIN_KERNEL_ADDRESS. This is used to allocate the BSP's pcpu and
dpcpu regions. The rest is given to the UMA per-CPU bootstrap
allocator, which returns memory from this region.

During SI_SUB_CPU, after CPU IDs are fixed, pcpu_layout() computes the
addresses for the rest of the pcpu structures. It tries to pack in as
many CPUs as it can into each 2MB page. It also backs the rest of the
bootstrap region with 2MB physical pages if needed. After this point
UMA's pcpu allocator exits bootstrap mode, and starts using a vmem arena
to manage KVA for CPU 0.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 31336
Build 28970: arc lint + arc unit

Event Timeline

markj requested review of this revision.May 8 2020, 12:21 AM

markj created this revision.

Harbormaster completed remote builds in B30977: Diff 71535.May 8 2020, 12:21 AM

markj added reviewers: jeff, kib, alc, rlibby, jhb.May 8 2020, 12:27 AM

gbe added a subscriber: gbe.May 8 2020, 5:39 AM

Add a vm_phys_seg[] entry for the bootstrap pcpu region.
When pcpu_layout() backs the rest of the bootstrap region with 2MB pages, it must subtract 2MB from the allocation size for domain 0, since it is already backed by the initial pcpu bootstrap allocation.
Rebase on D24755 to shrink the uma_core.c diff slightly.

Harbormaster completed remote builds in B30981: Diff 71550.May 8 2020, 3:11 PM

markj edited the summary of this revision. (Show Details)May 8 2020, 3:36 PM

gallatin added a subscriber: gallatin.May 8 2020, 4:10 PM

afedorov added a subscriber: afedorov.May 8 2020, 6:03 PM

Restore DPCPU indirection in kernel modules for now. The value of DPCPU_START is kld-specific, so DPCPU_BASE_OFFSET() gives the wrong value for dpcpu fields allocated from the kld region (the modspace field).
Fix a bug in uma_pcpu_init2(): when marking the bootstrap region as allocated, we have to allocate the region at the same granularity as it gets freed, i.e., a page at a time.

Harbormaster completed remote builds in B31061: Diff 71715.May 13 2020, 12:43 PM

kib added inline comments.May 17 2020, 1:01 PM

sys/amd64/amd64/mp_machdep.c
189	I believe this formula needs some comment and might be also a diagram explaining the layout.
201	So we assume that BSP is always in domain 0. Is it enforced somehow ?
sys/amd64/amd64/pmap.c
1418	Why uint64_t and not vm_offset_t or u_long ?
1696	I had to re-init cpuhead slist there, otherwise BSP appeared twice on it. How do you handle that ?

kib added inline comments.May 17 2020, 1:16 PM

sys/amd64/include/pcpu_aux.h
44–45	I think that after your patch, this assert would better express the intent if you check that sizeof == PAGE_SIZE.

Fix cpuhead initialization.
Weaken the amd64 assertion about sizeof(struct pcpu).

Harbormaster completed remote builds in B31189: Diff 72010.May 20 2020, 3:51 AM

markj added inline comments.May 20 2020, 3:51 AM

sys/amd64/amd64/mp_machdep.c
201	It is not enforced AFAIK. I can see two solutions Modify renumber_domains() to ensure that the BSP's domain is 0. Add some indirection in the layout calculation here: instead of having runs of 2MB pages: \| dom0 pcpu \| dom1 pcpu \| ... \| reorder them so that the BSP's domain always comes first. Once the layout is calculated, we do not require the domains to be in any particular order, it will just make pcpu_layout() more complicated. Do you think it is useful to ensure that the BSP belongs to domain 0?
sys/amd64/amd64/pmap.c
1418	Just for consistency with allocpages() above.
1696	It is a bug in the patch, thanks.
sys/amd64/include/pcpu_aux.h
44–45	I think we just assume that the sizeof is a multiple of PAGE_SIZE. I updated the assertion.

markj marked an inline comment as done.May 20 2020, 6:20 AM

markj added inline comments.

sys/amd64/amd64/mp_machdep.c
189	This can actually be simplified. I added a block comment above pcpu_layout().

Simplify layout calculation.
Add a block comment above pcpu_layout().

Harbormaster completed remote builds in B31191: Diff 72018.May 20 2020, 6:21 AM

rlibby mentioned this in D24755: Permit deferred creation of SMR structures for the VM radix zone..May 20 2020, 9:02 PM

Othwewise looks good.

sys/amd64/amd64/mp_machdep.c
201	I reviewed ACPI 6.3 spec, our algorithm to allocate map vm domains to ACPI SRAT domains (or reverse), and how Intel hw selects boot CPU. Basically, my belief is that the concern is real, since from what I know, Intel multi-socket hardware starts by executing the same BIOS code on designated cores on each socket, and then they select one winner among sockets. So in theory we might end up with BSP which domain is not the first domain in SRAT. I suspect this does not happen only because current BIOSes reprogram home agents so that BSP socket' provides the lowest addresses in the phys memory map. But there seems to be no provision in the ACPI standard that would imply this. I do not have a preference to the approaches you described, it is up to you as implementor. I am fine with whatever fits you.

This revision is now accepted and ready to land.May 23 2020, 7:17 PM

Handle the possibility that the BSP does not belong to domain 0.

This revision now requires review to proceed.May 27 2020, 3:09 PM

Harbormaster completed remote builds in B31336: Diff 72330.May 27 2020, 3:09 PM

kib accepted this revision.May 27 2020, 8:50 PM

This revision is now accepted and ready to land.May 27 2020, 8:50 PM

kib added inline comments.May 27 2020, 8:59 PM

sys/amd64/amd64/uma_machdep.c
138	Hmm, should this also do shuffling in style of pcpu_domidx() ?

markj added inline comments.May 27 2020, 9:08 PM

sys/amd64/amd64/uma_machdep.c
138	I don't think so. Here we are just allocating 4KB independently for each CPU. The pcpu structures are already placed, so `pc_domain` gives the correct domain.

markj mentioned this in D35084: Allocate the pcpu memory on a single level 2 page.Apr 27 2022, 10:49 PM

markj mentioned this in D40772: Tentative physical memory compaction.Nov 6 2023, 4:52 PM

bnovkov added a subscriber: bnovkov.May 25 2024, 6:56 PM

Herald added a subscriber: ehem_freebsd_m5p.com. · View Herald TranscriptMay 25 2024, 6:56 PM

swills added a subscriber: swills.Aug 15 2024, 3:40 AM

Introduce a dynamic pcpu layout on amd64.
AcceptedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 72330

sys/amd64/amd64/machdep.c

sys/amd64/amd64/mp_machdep.c

sys/amd64/amd64/pmap.c

sys/amd64/amd64/uma_machdep.c

sys/amd64/include/pcpu.h

sys/amd64/include/pcpu_aux.h

sys/amd64/include/vmparam.h

sys/i386/i386/mp_machdep.c

sys/sys/pcpu.h

sys/vm/uma_core.c

sys/vm/uma_int.h

sys/vm/vm_kern.c

sys/x86/include/x86_smp.h

sys/x86/x86/mp_x86.c

sys/x86/xen/pv.c

Introduce a dynamic pcpu layout on amd64.AcceptedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 72330

sys/amd64/amd64/machdep.c

sys/amd64/amd64/mp_machdep.c

sys/amd64/amd64/pmap.c

sys/amd64/amd64/uma_machdep.c

sys/amd64/include/pcpu.h

sys/amd64/include/pcpu_aux.h

sys/amd64/include/vmparam.h

sys/i386/i386/mp_machdep.c

sys/sys/pcpu.h

sys/vm/uma_core.c

sys/vm/uma_int.h

sys/vm/vm_kern.c

sys/x86/include/x86_smp.h

sys/x86/x86/mp_x86.c

sys/x86/xen/pv.c

Introduce a dynamic pcpu layout on amd64.
AcceptedPublic
Actions

Revision Contents
Changeset List