The bhyve acpi MADT table was given a static space of 256 (0x100) bytes, this is enough space to allow VM_MAXCPU to be 21, this patch fixes that so VM_MAXCPU can be of arbitrary value and not overflow the space by actually calculating the space needed for the table.
This is a fix for PR212782
Details
Compile vmm.ko, and related code (libvmmapi, bhyve and bhyveload) with VM_MAXCPU=24 and boot a VM, note that the panic in virtualization maillist no longer occurs
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
It doesn't looks quite right! I need more time to articulate better arguments, but this patch as it is, must not be committed in.
usr.sbin/bhyve/acpi.c | ||
---|---|---|
80 ↗ | (On Diff #52754) | Gratuitous change that doesn't add anything. Mind let it be like was before? |
usr.sbin/bhyve/acpi.c | ||
---|---|---|
87 ↗ | (On Diff #52754) | Document this magic formula, that information shall be needed later when this whole table is built at run time. |
I have applied these related VM_MAXCPU patches to 12.0R and rebuilt the related components.
https://reviews.freebsd.org/D18815
https://reviews.freebsd.org/D18816
https://reviews.freebsd.org/D18846
https://reviews.freebsd.org/D18755
It appears we have an ACPI limit of 65 sockets that is unrelated to the VM_MAXCPU limit.
Using a mixture of sockets, cores, and threads such as:
"-c sockets=1,cores=255,threads=1" or "-c sockets=8,cores=8,threads=2"
I could reach 255 vCPUS with FreeBSD 10.4, 11.1R, and 12.0R:
ACPI APIC Table: <BHYVE BVMADT >
FreeBSD/SMP: Multiprocessor System Detected: 255 CPUs
FreeBSD/SMP: 1 package(s) x 255 core(s)
sysctl hw.ncpu
hw.ncpu: 255
I could reach 192 vCPUs with NetBSD, CentOS 7, and Debian 9.4.0, and 64 vCPUs with OpenBSD as hard-defined:
/arch/amd64/include/cpu.h:#define MAXCPUS 64 /* bitmask */
NetBSD and Linux would drop to 1 CPU when their limit is exceeded while FreeBSD will throw errors: (Not sure which behavior is preferred.)
With "sockets=1,cores=257,threads=1"
Booting /tmp/bhyve.BHl9LfJ 1812: [0001] Processor ID : 100
Error 6303 - Integer too large for target ^ (0000000000000100 - max 1 bytes) /tmp/bhyve.BHl9LfJ 1813: [0001] Local Apic ID : 100 Error 6303 - Integer too large for target ^ (0000000000000100 - max 1 bytes) Assertion failed: (error == 0), function main, file /usr/src/usr.sbin/bhyve/bhyverun.c, line 1115. Abort trap (core dumped)
This reminds us that we have a wrapping issue on our error messages that are unrelated to these patches.
I will continue testing but thus far, the 255 vCPU limit appears from bhyve's construction of the ACPI table, which is external to these patches.
Please send your testing ideas, especially related to the concerns raised.
To use these patches, first increase VM_MAXCPU from 16 in:
/usr/src/sys/amd64/include/vmm.h
/usr/include/machine/vmm.h
I used 1024 in my testing.
My my Phabricator has its quirks. Sorry for the extra message.
I did a quick MAXCPU increase test on 12.0R based on comments by cpersiva@ but the results were neither positive nor directly related to these patches.
usr.sbin/bhyve/acpi.c | ||
---|---|---|
42 ↗ | (On Diff #52754) | We need to update this table. In particular, you cannot overflow beyond the 1MB boundary. Legacy BIOS boot loaders assume that RAM starting at 1MB is free and will overwrite it. FreeBSD's boot loader does this (though bhyveload works around this by accident). FreeBSD's kernel is always located at 2MB (PAE/amd64) or 4MB (i386), though older FreeBSD kernels ran at 1MB (early 4.x). Other OS's may make similar assumptions (e.g. Linux). EFI doesn't suffer from this as EFI provides an explicitly memory map and manages all of RAM. However, for the non-EFI case (which is what these tables are for), we can't overflow the 1MB boundary. We should probably just pick a vCPU limit for non-EFI booting that keeps us under 1MB and tell folks who want more CPUs to use EFI. |
80 ↗ | (On Diff #52754) | Well, size constants are being added, but the new comment does need to be wrapped properly. (I use M-q in Emacs to rewrap comments) |
usr.sbin/bhyve/acpi.c | ||
---|---|---|
42 ↗ | (On Diff #52754) | This is a static table for now at compile time, I shall add a CTASSERT that makes sure we do not over flow the space reserved for it. When this table is built at runtime that well need to become a runtime check. |
You can't usefully go above 255 CPUs without a lot more work. The MADT table requires separate entry types for APIC IDs > 254 (the field is 8 bits). However, there is a much larger issue beyond that which is that interrupts are routed with 8 bit APIC IDs (0 - 254 as well) in I/O APIC interrupt pin registers and MSI interrupt data fields. To deal with APIC IDs > 254, an I/O MMU is required that includes an interrupt remapping table that maps (8-bit APIC ID, IDT vector) -> (32-bit x2APIC ID, IDT vector). Any OS that can support more than 255 CPUs such as FreeBSD or Linux will require this. This means we would need to provide an emulated I/O MMU as part of bhyve for guest OS's to use. This represents a fairly large bit of work and is just the way x86 works. In addition, the work is more complicated than just remapping cookies as I'm not sure how it interacts with hardware virtualization exceptions like interrupt posting that try to bypass the VMM and inject pass-through interrupts directly into a guest. It's probably doable, but it's a very nontrivial amount of work, and is probably not worth doing until there's a compelling use case for more than 255 CPUs. Especially since bhyve doesn't handle cases where the number of vCPUs is > number of cores very well.
usr.sbin/bhyve/acpi.c | ||
---|---|---|
42 ↗ | (On Diff #52754) | The total size of this table is on the order of a few kbytes, it has room for 54kbytes, I have added a max of (254 * 8)==2032 bytes, the table is no more in danger of overflowing now than it was before. The potential overflow bug existed before, though near impossible to occur. However I shall in the future when this ACPI table generation gets properly rewritten include a check for overflow of the 1MB limit as I plan to remove all these hard coded There is no reason to just pick a maxcpu limit for the ACPI case, the code has already been tested to 4 x the limit that we are going to have to impose because of the APIC id being 8 bit. |
Please comment and provide a reason for blocking this, or remove your blocking status.
Thanks,
Rod
usr.sbin/bhyve/acpi.c | ||
---|---|---|
42 ↗ | (On Diff #52754) | Ok, that sounds good. I think we just want bhyve to exit with a suitable error (you have too many CPUs or something else) if we generate a DSDT that overflows 1MB. |
usr.sbin/bhyve/acpi.c | ||
---|---|---|
87 ↗ | (On Diff #52754) | Much appreciated reference material |