This is WIP branch, I published it for some discussions related to DMAR code refactoring.
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
sys/x86/iommu/intel_dmar.h | ||
---|---|---|
296 | Some of these flags are used by generic IOMMU busdma backend: should we move the flags to sys/iommu.h ? |
sys/x86/iommu/intel_dmar.h | ||
---|---|---|
296 | I think either all should be moved, or none. I am fine with the move. That said, after this merge I do not like sys/iommu.h idea. I think now that it should be sys/dev/iommu where MI files are put. |
sys/amd64/vmm/vmm.c | ||
---|---|---|
941–943 | I'm a bit surprised at how this code obtains the hpa. If not for the fact that the memory is wired, what follows would be racy. And more to my real point, this code already knows about vm objects, so why doesn't it just iterate over the vm object's resident page list? Then, it could look for m->psind > 0 to increase the granularity of the mapping changes. |
sys/amd64/vmm/vmm.c | ||
---|---|---|
941–943 | But the pages must be wired for pass-through to work, at least in the current state of the driver and the level of supported hardware. In principle, newer VT-d spec allows hardware to report non-fatal faults which can be handled by the DMAR driver. In fact, Mellanox/Nvidia ConnectX networking cards has similar facilities as well, but our RDMA/IB stack does not use it. Looking into objects would make an assumption about the structure of the vmspace, for instance we must not have shadowing. This is true right now, but I suspect that stuff like snapshots invalidates the assumption (not sure at what state the snapshot code is). |
Do you want to resume work on this, and finish some pieces and commit them?
sys/amd64/vmm/vmm.c | ||
---|---|---|
941–943 | Elaborating on my comment, I was surprised that the code wires the pages a second time to get the hpa. Note how the second wiring is immediately released before the hpa is calculated. In regards to Mellanox/Nvidia and linuxkpi, I fear that we'll have to support MMU notifiers at some point, which will require changes sprinkled throughout the virtual memory system. |
I hope so, but I am not sure I can do anything that large ATM. Do you have specific proposals about already useful pieces?
sys/amd64/vmm/vmm.c | ||
---|---|---|
941–943 | You mean the mere call to vm_fault_quick_hold_pages(), which is several lines later ends up in vm_page_unhold_pages(), after iommu_create_mapping()? And the first wire is due to vm being used for pass-through? I think this is a minor overhead, in fact. WRT to MMU notifiers, do you mean some kind of callback when the page becomes invalid or reclaimed? |
While nothing will use it at the instant that you commit it, I am going to suggest that you peel off iommu_gas_remove() and commit it. Then, as I make further changes to iommu_gas, it will be easier for me to take into account the parts for eventually supporting bhyve.
Is this still WIP? And could it be related to https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=270966 and https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=266325?
Yes it is WIP now I hope, instead of just abandonware.
There is no OS driver for AMD IOMMU so the referenced bugs are about AMD 'hack' driver currently existing in bhyve.
sys/amd64/vmm/vmm.c | ||
---|---|---|
1016 | Now that individual mapping operations can fail, don't you need to keep track of pending changes and undo them (e.g. remove new mappings if the Nth call to iommu_create_mapping() fails)? | |
1066 | This is no longer paired? That is, the old code does this only when the first device is assigned, but now you do it each time a device is assigned, yet vm_unassign_pptdev only does the unmap when the last ppt device is unassigned? | |
sys/dev/iommu/iommu.h | ||
140 | Can we perhaps not make these names DMAR-specific? Presumably we will need a similar notion of separate domains for pass through on arm64, and also when we have an AMD IOMMU driver. | |
sys/x86/iommu/intel_dmar.h | ||
282 | Can we provide a wrapper for this in the iommu.h interface? Ideally the new iommu backend for vmm.ko would just use APIs from iommu.h and eventually once we have an AMD IOMMU driver we would retire the iommu abstraction in vmm.ko directly and just use iommu.h directly in vmm.ko. | |
usr.sbin/bhyve/pci_passthru.c | ||
560 | Do you have a device with a BAR smaller than a page that you are using with pass through? |
sys/dev/iommu/iommu.h | ||
---|---|---|
140 | Already done, thank you for noting. | |
sys/x86/iommu/intel_dmar.h | ||
282 | I am not sure I follow your proposal there. ATM domains+contexts are DMAR-specific right now, and in the scope of this patch. The plan is to develop it into the common substrate useful both for DMAR and AMD IOMMU. But I am not there yet, I need to correct the current patch to provide something reasonable and working for Intel first (I believe). | |
usr.sbin/bhyve/pci_passthru.c | ||
560 | When I tested this many years ago I did it with passthru of ahci, on something like Skylake. Even today, looking at my w/s Kabylake, I do see ahci0@pci0:0:23:0: class=0x010601 rev=0x21 hdr=0x00 vendor=0x8086 device=0x9d03 subvendor=0x1458 subdevice=0x1000 vendor = 'Intel Corporation' device = 'Sunrise Point-LP SATA Controller [AHCI mode]' class = mass storage subclass = SATA bar [10] = type Memory, range 32, base 0xdf348000, size 8192, enabled bar [14] = type Memory, range 32, base 0xdf34c000, size 256, enabled bar [18] = type I/O Port, range 32, base 0xf090, size 8, enabled bar [1c] = type I/O Port, range 32, base 0xf080, size 4, enabled bar [20] = type I/O Port, range 32, base 0xf060, size 32, enabled bar [24] = type Memory, range 32, base 0xdf34b000, size 2048, enabled cap 05[80] = MSI supports 1 message enabled with 1 message And these are obviously not MSI(X) bars. |
usr.sbin/bhyve/pci_passthru.c | ||
---|---|---|
560 | Hummm. I wonder if we really think it's ok to pass through the entire page to the guest in this case as it means exposing BARs of other devices to the guest (probably a bad idea) or if instead we want to treat this case like the MSI-X table where we intercept reads and writes to this BAR instead. This change should probably be a separate change in its own right though and not related to DMAR IMO. |
usr.sbin/bhyve/pci_passthru.c | ||
---|---|---|
560 | The per-commit split of this review can be seen at https://kib.kiev.ua/git/gitweb.cgi?p=deviant3.git;a=shortlog;h=refs/heads/dmar_bhyve_integ |