Page MenuHomeFreeBSD

bhyve, DMAR: integrate
Needs ReviewPublic

Authored by kib on Jul 14 2020, 8:13 PM.
Tags
None
Referenced Files
F108480145: D25672.diff
Sat, Jan 25, 8:59 AM
Unknown Object (File)
Mon, Jan 20, 9:00 AM
Unknown Object (File)
Sat, Jan 18, 12:58 AM
Unknown Object (File)
Sat, Jan 18, 12:07 AM
Unknown Object (File)
Fri, Jan 10, 2:13 PM
Unknown Object (File)
Fri, Jan 10, 1:37 PM
Unknown Object (File)
Tue, Jan 7, 10:08 AM
Unknown Object (File)
Sun, Jan 5, 6:51 AM

Details

Reviewers
br
Group Reviewers
bhyve
Summary

This is WIP branch, I published it for some discussions related to DMAR code refactoring.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib requested review of this revision.Jul 14 2020, 8:14 PM
sys/x86/iommu/intel_dmar.h
296

Some of these flags are used by generic IOMMU busdma backend:
IOMMU_MF_CANWAIT/CANSPLIT passed to iommu_map()

should we move the flags to sys/iommu.h ?

sys/x86/iommu/intel_dmar.h
296

I think either all should be moved, or none. I am fine with the move.

That said, after this merge I do not like sys/iommu.h idea. I think now that it should be sys/dev/iommu where MI files are put.

sys/amd64/vmm/vmm.c
941–943

I'm a bit surprised at how this code obtains the hpa. If not for the fact that the memory is wired, what follows would be racy. And more to my real point, this code already knows about vm objects, so why doesn't it just iterate over the vm object's resident page list? Then, it could look for m->psind > 0 to increase the granularity of the mapping changes.

sys/amd64/vmm/vmm.c
941–943

But the pages must be wired for pass-through to work, at least in the current state of the driver and the level of supported hardware. In principle, newer VT-d spec allows hardware to report non-fatal faults which can be handled by the DMAR driver. In fact, Mellanox/Nvidia ConnectX networking cards has similar facilities as well, but our RDMA/IB stack does not use it.

Looking into objects would make an assumption about the structure of the vmspace, for instance we must not have shadowing. This is true right now, but I suspect that stuff like snapshots invalidates the assumption (not sure at what state the snapshot code is).

Do you want to resume work on this, and finish some pieces and commit them?

sys/amd64/vmm/vmm.c
941–943

Elaborating on my comment, I was surprised that the code wires the pages a second time to get the hpa. Note how the second wiring is immediately released before the hpa is calculated.

In regards to Mellanox/Nvidia and linuxkpi, I fear that we'll have to support MMU notifiers at some point, which will require changes sprinkled throughout the virtual memory system.

kib marked 3 inline comments as done.Jul 29 2022, 11:16 AM
In D25672#816160, @alc wrote:

Do you want to resume work on this, and finish some pieces and commit them?

I hope so, but I am not sure I can do anything that large ATM. Do you have specific proposals about already useful pieces?

sys/amd64/vmm/vmm.c
941–943

You mean the mere call to vm_fault_quick_hold_pages(), which is several lines later ends up in vm_page_unhold_pages(), after iommu_create_mapping()? And the first wire is due to vm being used for pass-through?

I think this is a minor overhead, in fact.

WRT to MMU notifiers, do you mean some kind of callback when the page becomes invalid or reclaimed?

In D25672#817019, @kib wrote:
In D25672#816160, @alc wrote:

Do you want to resume work on this, and finish some pieces and commit them?

I hope so, but I am not sure I can do anything that large ATM. Do you have specific proposals about already useful pieces?

While nothing will use it at the instant that you commit it, I am going to suggest that you peel off iommu_gas_remove() and commit it. Then, as I make further changes to iommu_gas, it will be easier for me to take into account the parts for eventually supporting bhyve.

kib marked an inline comment as done.

Rebase after iommu_gas_remove() commit.

Yes it is WIP now I hope, instead of just abandonware.

There is no OS driver for AMD IOMMU so the referenced bugs are about AMD 'hack' driver currently existing in bhyve.

jhb added inline comments.
sys/amd64/vmm/vmm.c
1016

Now that individual mapping operations can fail, don't you need to keep track of pending changes and undo them (e.g. remove new mappings if the Nth call to iommu_create_mapping() fails)?

1066

This is no longer paired? That is, the old code does this only when the first device is assigned, but now you do it each time a device is assigned, yet vm_unassign_pptdev only does the unmap when the last ppt device is unassigned?

sys/dev/iommu/iommu.h
140

Can we perhaps not make these names DMAR-specific? Presumably we will need a similar notion of separate domains for pass through on arm64, and also when we have an AMD IOMMU driver.

sys/x86/iommu/intel_dmar.h
282

Can we provide a wrapper for this in the iommu.h interface? Ideally the new iommu backend for vmm.ko would just use APIs from iommu.h and eventually once we have an AMD IOMMU driver we would retire the iommu abstraction in vmm.ko directly and just use iommu.h directly in vmm.ko.

usr.sbin/bhyve/pci_passthru.c
560

Do you have a device with a BAR smaller than a page that you are using with pass through?

kib marked 2 inline comments as done.Dec 19 2023, 10:55 AM
kib added inline comments.
sys/dev/iommu/iommu.h
140

Already done, thank you for noting.

sys/x86/iommu/intel_dmar.h
282

I am not sure I follow your proposal there.

ATM domains+contexts are DMAR-specific right now, and in the scope of this patch. The plan is to develop it into the common substrate useful both for DMAR and AMD IOMMU. But I am not there yet, I need to correct the current patch to provide something reasonable and working for Intel first (I believe).

usr.sbin/bhyve/pci_passthru.c
560

When I tested this many years ago I did it with passthru of ahci, on something like Skylake. Even today, looking at my w/s Kabylake, I do see

ahci0@pci0:0:23:0:   class=0x010601 rev=0x21 hdr=0x00 vendor=0x8086 device=0x9d03 subvendor=0x1458 subdevice=0x1000
    vendor     = 'Intel Corporation'
    device     = 'Sunrise Point-LP SATA Controller [AHCI mode]'
    class      = mass storage
    subclass   = SATA
    bar   [10] = type Memory, range 32, base 0xdf348000, size 8192, enabled
    bar   [14] = type Memory, range 32, base 0xdf34c000, size 256, enabled
    bar   [18] = type I/O Port, range 32, base 0xf090, size 8, enabled
    bar   [1c] = type I/O Port, range 32, base 0xf080, size 4, enabled
    bar   [20] = type I/O Port, range 32, base 0xf060, size 32, enabled
    bar   [24] = type Memory, range 32, base 0xdf34b000, size 2048, enabled
    cap 05[80] = MSI supports 1 message enabled with 1 message

And these are obviously not MSI(X) bars.

usr.sbin/bhyve/pci_passthru.c
560

Hummm. I wonder if we really think it's ok to pass through the entire page to the guest in this case as it means exposing BARs of other devices to the guest (probably a bad idea) or if instead we want to treat this case like the MSI-X table where we intercept reads and writes to this BAR instead. This change should probably be a separate change in its own right though and not related to DMAR IMO.

kib marked 2 inline comments as done.Dec 19 2023, 6:51 PM
kib added inline comments.
usr.sbin/bhyve/pci_passthru.c
560