vmm: Add support for specifying NUMA configuration
Needs ReviewPublic
Actions

Authored by bnovkov on Mar 30 2024, 4:33 PM.

Details

Reviewers

jhb
corvink
markj

Group Reviewers

bhyve

Summary

This patch adds the necessary kernelspace bits required for supporting NUMA domains in bhyve VMs.

The layout of system memory segments and how they're created has been reworked.
Each guest NUMA domain will now have its own memory segment.
Furthermore, we can now allocate memory for a given guest domain from a specific physical NUMA domain on the host.
Only DOMAINSET_PREF() is supported for now.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

bnovkov created this revision.Mar 30 2024, 4:33 PM

Herald added a reviewer: bhyve. · View Herald TranscriptMar 30 2024, 4:33 PM

Herald added subscribers: bcran, rgrimes, imp. · View Herald Transcript

bnovkov requested review of this revision.Mar 30 2024, 4:33 PM

bnovkov added a child revision: D44566: libvmmapi: Add support for setting up and configuring guest NUMA domains.

It's not clear to me why we don't extend the vm_memmap structure instead.

Stepping back for a second, the goal of this patch is not really clear to me. I can see two possibilities:

We want to create a fake NUMA topology, e.g., to make it easier to use bhyve to test NUMA-specific features in guest kernels.
We want some way to have bhyve/vmm allocate memory from multiple physical NUMA domains on the host, and pass memory affinity information to the guest. In that case, vmm itself needs to ensure, for example, that the VM object for a given memseg has the correct NUMA allocation policy.

I think this patch ignores the second goal and makes it harder to implement in the future. It also appears to assume that each domain can be described with a single PA range, and I don't really understand why vmm needs to know the CPU affinity of each domain.

IMO a better approach would be to start by finding a way to assign a domain ID to each memory segment. This might require extending some existing interfaces in libvmmapi, particularly vm_setup_memory().

In D44565#1016805, @markj wrote:

It's not clear to me why we don't extend the vm_memmap structure instead.

Stepping back for a second, the goal of this patch is not really clear to me. I can see two possibilities:

We want to create a fake NUMA topology, e.g., to make it easier to use bhyve to test NUMA-specific features in guest kernels.

We want some way to have bhyve/vmm allocate memory from multiple physical NUMA domains on the host, and pass memory affinity information to the guest. In that case, vmm itself needs to ensure, for example, that the VM object for a given memseg has the correct NUMA allocation policy.

I think this patch ignores the second goal and makes it harder to implement in the future.

You're right, the primary goal was to have a way of faking NUMA topologies in a guest for kernel testing purposes. I did consider the second goal but ultimately decided to focus on the "fake" bits first and implement the rest in a separate patch.
I'll rework the patch so that it covers both goals.

It also appears to assume that each domain can be described with a single PA range, and I don't really understand why vmm needs to know the CPU affinity of each domain.

I'm not that happy about directly specifying PA ranges directly. The only other thing I could think of is to let the user specify the amount of memory per-domain and let bhyve deal with PA ranges, do you think that this is a more sane approach?
As for the CPU affinities, these are needed for SRAT but that can be done purely from userspace. I've kept them in vmm in case we might want to get NUMA topology info using bhyvectl, but I guess that information can be obtained from the guest itself. I'll remove the cpusets.