Hardware with more than 256 cores is now available and will become increasingly common.
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
We have #define MAX_APIC_ID 0x200 and _Static_assert(MAXCPU <= MAX_APIC_ID "MAXCPU cannot be larger that MAX_APIC_ID");
There are many MAXCPU-sized arrays that we should investigate along with this work.
Examples from the first third or so from grep -w MAXCPU:
#define NPV_LIST_LOCKS MAXCPU struct pmap_pcids pm_pcids[MAXCPU]; static struct asid asid[MAXCPU]; static uint8_t hsave[MAXCPU][PAGE_SIZE] __aligned(PAGE_SIZE); int vmxon_enabled[MAXCPU]; static char vmxon_region[MAXCPU][PAGE_SIZE] __aligned(PAGE_SIZE); long eptgen[MAXCPU]; /* cached pmap->pm_eptgen */ static struct cam_doneq cam_doneqs[MAXCPU]; cpu_core_t cpu_core[MAXCPU]; solaris_cpu_t solaris_cpu[MAXCPU]; #define NCPU MAXCPU static int64_t tsc_skew[MAXCPU]; struct dtrace_debug_data { ... } dtrace_debug_data[MAXCPU]; #define WORK_CPU_UNBOUND MAXCPU /* Structure to store portals' properties */ struct XX_PortalInfo { vm_paddr_t portal_ce_pa[2][MAXCPU]; vm_paddr_t portal_ci_pa[2][MAXCPU]; uint32_t portal_ce_size[2][MAXCPU]; uint32_t portal_ci_size[2][MAXCPU]; vm_offset_t portal_ce_va[2]; vm_offset_t portal_ci_va[2]; uintptr_t portal_intr[2][MAXCPU]; }; struct hpet_softc { ... int pcpu_slaves[MAXCPU]; ... } #define ACPI_MAX_TASKS MAX(32, MAXCPU * 4) struct sysctl_oid *sc_sysctl_cpu[MAXCPU]; bool sc_regs_mapped[MAXCPU]; /* register mapping status */ t_Handle sc_bph[MAXCPU]; /* BMAN portal handles */ struct dpaa_portals_softc { ... dpaa_portal_t sc_dp[MAXCPU]; ... }; struct qman_softc { ... bool sc_regs_mapped[MAXCPU]; ... t_Handle sc_qph[MAXCPU]; /* QMAN portal handles */ ... }; struct hv_storvsc_sysctl { ... u_long chan_send_cnt[MAXCPU]; ... }; struct storvsc_softc { ... struct vmbus_channel *hs_sel_chan[MAXCPU]; ... };
sys/netpfil/pf/pf.c:257:1: error: static_assert failed due to requirement '(1 << 8) >= 512' "compile-time assertion failed"
PFID_CPUBITS needs to be increased
Boot smoke test was successful with this additional change:
diff --git a/sys/netpfil/pf/pf.c b/sys/netpfil/pf/pf.c index 33ef5119ee3c..64d8bb92d5f0 100644 --- a/sys/netpfil/pf/pf.c +++ b/sys/netpfil/pf/pf.c @@ -250,7 +250,7 @@ VNET_DEFINE(uma_zone_t, pf_state_z); VNET_DEFINE(uma_zone_t, pf_state_key_z); VNET_DEFINE(uint64_t, pf_stateid[MAXCPU]); -#define PFID_CPUBITS 8 +#define PFID_CPUBITS 16 #define PFID_CPUSHIFT (sizeof(uint64_t) * NBBY - PFID_CPUBITS) #define PFID_CPUMASK ((uint64_t)((1 << PFID_CPUBITS) - 1) << PFID_CPUSHIFT) #define PFID_MAXID (~PFID_CPUMASK)
https://cirrus-ci.com/task/6065229367869440
I built with PFID_CPUBITS=16 before discussing with @kp; testing with PFID_CPUBITS=10 now.
I have a potential patch to remove that reliance on MAX_CPU, but go ahead and make the =10 change already if that unblocks things.
Yes, it should be counter(9) based. The existing code predates counter(9). Something pretty close to what we have to generate IPv4 header ID. Please add me to reviewers. Thanks!
I have a potential patch to remove that reliance on MAX_CPU
I'm not quite ready to commit the MAXCPU change since we should sort out the affinity syscalls first and may want to go to right to 1024, so perhaps your pf change will be in first. I have changed PFID_CPUBITS to 10 in my WIP branch, and will commit it with other MAXCPU-related changes if your work is not in the tree at that time.
See also: CPU_MAXSIZE in <sys/sys/_cpuset.h>
For MAX_APIC_ID, that may be the max non-x2APIC ID which needs to stay at 255 if so. However, the assertion about MAXCPU being <= MAX_APIC_ID is no longer relevant with x2APIC and can be removed I think. I'm not sure if we support x2APIC on i386, so maybe the assertion has to stay for i386 still? (More of a @kib question)
I don't think that really works. It'd be basically the same as the current approach, and we'd have to keep adding in CPU bits to avoid conflicts.
My current plan is to use atomic_fetchadd_64(), but that breaks on i386 (although that does have atomic_fetchadd_long(), which may suffice).
This is what I have in mind: https://reviews.freebsd.org/D36915
This does work on i386, because I was misreading the compiler error at first.
But you need to also bump MAX_APIC_ID, or else the MADT parse code will start dropping CPUs that have IDs > 512. Note that APIC IDs don't need to be continuous, so it's possible to have a system only with even APIC IDs (ie: 0, 2, 4..) at which point in order to support 512 CPUs you might need to be able to parse APIC IDs up to 1024 or more.
It seems like the MAX_APIC_ID is a safeguard, because we have some data structures that are allocated based on the max APIC ID found on the system (lapics struct in madt.c). You need to also bump MAX_APIC_ID to twice the value of MAXCPU I would think, and ideally we would like to remove the dependency on allocating structures based on the max APIC ID and only allocate based on the number of CPUs on the system, or else we are likely to waste memory.
MAX_APIC_ID increase in D37006. I propose doing that as a separate change in advance of increasing MAXCPU, as MAXCPU may be set via the kernel config file and it is currently not possible to increase it.
This review was flagged in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264840 as well...
lgtm, we should get this in now so we have a few months to see what breaks before 14.0.
There are at least 2 things to do before this can land imo.
- 'size' the kernel binary before/after (preferably LINT)
- check total memory use from booting single user on a small box before/after, preferably single-core. i guess diffing sysctl -a would provide enough info?
Before:
> size kernel text data bss dec hex filename 23076646 1870505 4415872 29363023 0x1c00b4f kernel > size vmm.ko text data bss dec hex filename 177636 30744 5279816 5488196 0x53be44 vmm.ko
after:
> size kernel text data bss dec hex filename 23086310 2657033 5726464 31469807 0x1e030ef kernel > size vmm.ko text data bss dec hex filename 181732 30744 11587656 11800132 0xb40e44 vmm.ko
This is GENERIC-NODEBUG.
I grabbed the kernel symbol sizes, sorted by size, and stashed them here:
https://people.freebsd.org/~markj/cpuset/sizes.cpuset.txt
https://people.freebsd.org/~markj/cpuset/sizes.main.txt
At a glance, the largest increases are bdomain (267776 -> 1054208), stoppcbs (81920 -> 327680), cc_cpu (81920 -> 327680). The size increase isn't terrible as-is, though.
vmm has a couple of static arrays (hsave and vmxon_region) which appear to be responsible for most of the vmm.ko size increase.
data 1870505 -> 2657033 gives +42%
bss 4415872 -> 5726464 gives + 29.5%
this is pretty atrocious and imo a showstopper for 1024.
what probably would make it tolerable:
- bump only to 512
- cc_cpu and stoppcbs are candidates for immediate removal -- they can land in struct pcpu and if need be can be reached through the cpuid_to_pcpu array. probably would be nice to abstract it with a macro so that one can CPUID_TO_PCPU(cpu) instead of directly derefing it.
vmm has a couple of static arrays (hsave and vmxon_region) which appear to be responsible for most of the vmm.ko size increase.
Also note technically today is the start of KBI freeze. But there is not enough time to get this in, I'm guessing it would be best to postponed said freeze. It prodded gjb about it.
The only important number is the sum, which increases by 7%, and this increase can be reduced substantially by addressing a small number of cases.
what probably would make it tolerable:
- bump only to 512
- cc_cpu and stoppcbs are candidates for immediate removal -- they can land in struct pcpu and if need be can be reached through the cpuid_to_pcpu array. probably would be nice to abstract it with a macro so that one can CPUID_TO_PCPU(cpu) instead of directly derefing it.
Moving stoppcbs this way will break kgdb. Better to simply dynamically allocate the array once we know how many APs there are.
Objects in Mark's list that grew by over 1K (symbol, old size, new size, difference):
bdomain 267776 1054208 786432
group 49216 491680 442464
stoppcbs 81920 327680 245760
cc_cpu 81920 327680 245760
pmc_tf 49152 196608 147456
static_single_cpu_mask 8192 131072 122880
cam_doneqs 32768 131072 98304
vm_reserv_object_mtx 16384 65536 49152
ktls_domains 8512 33088 24576
vmspace0 2560 8800 6240
kernel_pmap_store 2256 8496 6240
sc_kts 2048 8192 6144
dpcpu_off 2048 8192 6144
cpuid_to_pcpu 2048 8192 6144
bootstacks 2048 8192 6144
__cpu_data 1536 6144 4608
nws_array 1024 4096 3072
cpu_apic_ids 1024 4096 3072
ktls_cpuid_lookup 512 2048 1536
urgh
kernel_pmap_store is of type pmap
struct pmap { struct mtx pm_mtx; pml4_entry_t *pm_pmltop; /* KVA of top level page table */ pml4_entry_t *pm_pmltopu; /* KVA of user top page table */ uint64_t pm_cr3; uint64_t pm_ucr3; TAILQ_HEAD(,pv_chunk) pm_pvchunk; /* list of mappings in pmap */ cpuset_t pm_active; /* active on cpus */ enum pmap_type pm_type; /* regular or nested tables */ struct pmap_statistics pm_stats; /* pmap statistics */ struct vm_radix pm_root; /* spare page table pages */ long pm_eptgen; /* EPT pmap generation id */ smr_t pm_eptsmr; int pm_flags; struct pmap_pcids pm_pcids[MAXCPU]; struct rangeset pm_pkru; };
as in +6K for every running process. this needs to be patched (move the array to the end and allocate as needed? EDIT: after cursory look this would be best served with real per-cpu but i don't know if that's available at that stage)
which prompts another note: instead of just checking obj sizes this clearly needs to check all structs for size changes, but I don't know a handy one-liner to do it. maybe bloaty can?
Thus I found struct pmap is embedded in struct vmspace and that has its own zone, making this cumbersome to fix. It also turns out the zone is NOFREE. what's up with that
sys/sys/_cpuset.h | ||
---|---|---|
43 ↗ | (On Diff #112038) | why is this separate, and even if it has to be, it should CTASSERT being not less than MAXCPU |
Maybe ABI Compliance Checker (ABICC)? It's not really the intended use but it might serve the purpose.
here are size changes as found with ctfdump -t:
-_cpuset 32
+_cpuset 128
-bufdomain 33472
+bufdomain 131776
-cpu_group 64
+cpu_group 160
-cpu_offset 48
+cpu_offset 144
-cpuset 104
+cpuset 200
-hhook_head 176
+hhook_head 272
-hn_softc 1440
+hn_softc 1536
-hpet_softc 39112
-hpet_timer 1216
+hpet_softc 137416
+hpet_timer 4288
-hv_storvsc_sysctl 2072
+hv_storvsc_sysctl 8216
-iflib_ctx 1008
+iflib_ctx 1104
-ktls_domain_info 1064
+ktls_domain_info 4136
-nl_control 216
+nl_control 312
-osd_master 192
+osd_master 288
-pde_action 72
+pde_action 168
-pmap 2256
+pmap 8496
-radix_node_head 312
+radix_node_head 408
-rib_head 616
+rib_head 712
-rmlock 96
+rmlock 192
-rmslock_ipi 40
+rmslock_ipi 136
-smp_rendezvous_cpus_retry_arg 32
+smp_rendezvous_cpus_retry_arg 128
-storvsc_softc 5384
+storvsc_softc 17672
-taskqgroup 6192
+taskqgroup 24624
-topo_node 104
+topo_node 200
-unhop_ctl 112
+unhop_ctl 208
-vfs_op_barrier_ipi 40
+vfs_op_barrier_ipi 136
-vmbus_softc 33152
+vmbus_softc 131456
-vmspace 2560
+vmspace 8800
I ran into this handsome fella in sched_4bsd:
#ifdef SMP /* * Per-CPU run queues */ static struct runq runq_pcpu[MAXCPU]; long runq_length[MAXCPU]; static cpuset_t idle_cpus_mask; #endif
this should probably error out and suggest reducing maxcpu
sys/sys/_cpuset.h | ||
---|---|---|
43 ↗ | (On Diff #112038) | This is the size used for cpuset_t in userspace. The kernel always uses MAXCPU via the above. |
As it happens there is one vm stat emitted by Cirrus-CI on each build:
main | https://cirrus-ci.com/task/5111357406707712?logs=test#L779 | vm.stats.vm.v_wire_count: 6433 |
MAXCPU=1024 | https://cirrus-ci.com/task/6123897498632192?logs=test#L779 | vm.stats.vm.v_wire_count: 6571 |
That's over 2% bump for nothing.
sysctl vm dump from these would be prudent to find out who is eating that memory.
that aside, @markj:
#define VM_RESERV_OBJ_LOCK_COUNT MAXCPU struct mtx_padalign vm_reserv_object_mtx[VM_RESERV_OBJ_LOCK_COUNT];
sounds like this should be allocated on boot instead? bonus points if it becomes numa-aware.
rebase after increasing userland cpuset size independently (76887e84be975698b14699d7d0dfb157d73e9990)
Had another look at v_wire_count in QEMU smoke boot just now:
1723e7f33ae | https://cirrus-ci.com/task/6175459847700480?logs=test#L777 | vm.stats.vm.v_wire_count: 5400 |
MAXCPU=1024 | https://cirrus-ci.com/task/5800807099006976?logs=test#L1017 | vm.stats.vm.v_wire_count: 5551 |
Some more low-hanging fruit: pmc_tf, static_single_cpu_mask (linuxkpi), cam_doneqs, ktls_domain_info, pending_locks (witness), vm_reserv_object_mtx. The differences account for 108 pages.