Page MenuHomeFreeBSD

vmm: Add support for specifying NUMA configuration
Needs ReviewPublic

Authored by bnovkov on Mar 30 2024, 4:33 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, May 19, 5:58 AM
Unknown Object (File)
Sat, May 18, 1:26 PM
Unknown Object (File)
Mon, May 13, 7:37 AM
Unknown Object (File)
Sat, May 4, 1:24 PM
Unknown Object (File)
Apr 26 2024, 4:35 AM
Unknown Object (File)
Apr 26 2024, 1:10 AM
Unknown Object (File)
Apr 21 2024, 5:30 PM
Unknown Object (File)
Apr 20 2024, 2:59 PM
Subscribers

Details

Reviewers
jhb
corvink
markj
Group Reviewers
bhyve
Summary

This patch adds the necessary kernelspace bits required for supporting NUMA domains in bhyve VMs.

The layout of system memory segments and how they're created has been reworked.
Each guest NUMA domain will now have its own memory segment.
Furthermore, we can now allocate memory for a given guest domain from a specific physical NUMA domain on the host.
Only DOMAINSET_PREF() is supported for now.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

It's not clear to me why we don't extend the vm_memmap structure instead.

Stepping back for a second, the goal of this patch is not really clear to me. I can see two possibilities:

  • We want to create a fake NUMA topology, e.g., to make it easier to use bhyve to test NUMA-specific features in guest kernels.
  • We want some way to have bhyve/vmm allocate memory from multiple physical NUMA domains on the host, and pass memory affinity information to the guest. In that case, vmm itself needs to ensure, for example, that the VM object for a given memseg has the correct NUMA allocation policy.

I think this patch ignores the second goal and makes it harder to implement in the future. It also appears to assume that each domain can be described with a single PA range, and I don't really understand why vmm needs to know the CPU affinity of each domain.

IMO a better approach would be to start by finding a way to assign a domain ID to each memory segment. This might require extending some existing interfaces in libvmmapi, particularly vm_setup_memory().

It's not clear to me why we don't extend the vm_memmap structure instead.

Stepping back for a second, the goal of this patch is not really clear to me. I can see two possibilities:

  • We want to create a fake NUMA topology, e.g., to make it easier to use bhyve to test NUMA-specific features in guest kernels.
  • We want some way to have bhyve/vmm allocate memory from multiple physical NUMA domains on the host, and pass memory affinity information to the guest. In that case, vmm itself needs to ensure, for example, that the VM object for a given memseg has the correct NUMA allocation policy.

I think this patch ignores the second goal and makes it harder to implement in the future.

You're right, the primary goal was to have a way of faking NUMA topologies in a guest for kernel testing purposes. I did consider the second goal but ultimately decided to focus on the "fake" bits first and implement the rest in a separate patch.
I'll rework the patch so that it covers both goals.

It also appears to assume that each domain can be described with a single PA range, and I don't really understand why vmm needs to know the CPU affinity of each domain.

I'm not that happy about directly specifying PA ranges directly. The only other thing I could think of is to let the user specify the amount of memory per-domain and let bhyve deal with PA ranges, do you think that this is a more sane approach?
As for the CPU affinities, these are needed for SRAT but that can be done purely from userspace. I've kept them in vmm in case we might want to get NUMA topology info using bhyvectl, but I guess that information can be obtained from the guest itself. I'll remove the cpusets.

bnovkov edited the summary of this revision. (Show Details)

Reworked patch and updated summary.