Currently we allocate a single 4KB page for each per-CPU structure and
place them in an array. UMA's dynamic per-CPU allocator returns arrays
of 4KB pages that mirror the base array, so per-CPU data can be
referenced relative to the location of the base pcpu structure.
There is no requirement that the base pcpu structures must be
contiguous, however. This commit adds some code to dynamically lay out
the base pcpu structures with the aim of using 2MB mappings for all
per-CPU data (base pcpu, DPCPU, and UMA slabs). This has the following
benefits:
1. Improved TLB efficiency. During boot a GENERIC kernel allocates
roughly 64KB of data per CPU, mostly for counter(9) and malloc(9)
stats. Currently all of that data is mapped using 4KB pages. The
amount of per-CPU data allocated during boot is only going to grow
over time, and some subsystems, e.g., pf, VFS, perform many dynamic
allocations of per-CPU data.
2. Better control over kernel memory fragmentation. Previously,
dynamically allocated per-CPU slabs were allocated directly from the
page allocator, so their placement is effectively random, and per-CPU
data structures tend to be long-lived.
3. The DPCPU indirection is removed. Previously, accessing a DPCPU
field involved an extra memory access. With this patch the DPCPU
region immediately follows the base pcpu structure, so its offset
relative to the base is known at compile-time.
4. Proper NUMA affinity for per-CPU structures allocated early during
boot. This allocator provides a bootstrap allocator which always
provides domain-correct memory. Previously, anything allocated
during SI_SUB_CPU or earlier would always be entirely on domain 0.
5. It allows the UMA per-CPU slab size limit of 4KB to be increased, if
that ever becomes useful.
CPUs are grouped by domain into 2MB pages. Initially, MAXCPU * NBPDR
bytes of KVA are reserved for per-CPU data, but this is a worst case and
in practice most of it is not used. Per-CPU data is always mapped into
ranges of 2MB pages. For example, with 4 CPUs and 2 domains, the
allocator always allocates 4MB of KVA at once. Usually the KVA quantum
will be 2MB * vm_ndomains, but it may be larger if there are many CPUs
in a domain. The initial segment for a given CPU looks like this:
| pcpu 0 | dpcpu 0 | UMA per-CPU slabs ... | pcpu 1 | dpcpu 1 | ...
Early during boot we allocate 2MB of physical memory for the BSP, mapped
at VM_MIN_KERNEL_ADDRESS. This is used to allocate the BSP's pcpu and
dpcpu regions. The rest is given to the UMA per-CPU bootstrap
allocator, which returns memory from this region.
During SI_SUB_CPU, after CPU IDs are fixed, pcpu_layout() computes the
addresses for the rest of the pcpu structures. It tries to pack in as
many CPUs as it can into each 2MB page. It also backs the rest of the
bootstrap region with 2MB physical pages if needed. After this point
UMA's pcpu allocator exits bootstrap mode, and starts using a vmem arena
to manage KVA for CPU 0.