Remove a wmb() that's not necessary. bus_dmamap_sync() is supposed to provide guarantees that ensure that memory that's prepared for PREWRITE can be DMA'd immediately after it returns. For non-x86 platforms, bus_dmamap_sync() takes care of ensuring that all writes to the command buffer has been posted well enough for the device to initiate DMA from that memory and get that contents. They all have the appropaite strength memory fence. For x86 platforms, the memory ordering is already strong enough. Once memory is written, the write to the uncached BAR to force the DMA to the device will get its contents. As such, we don't need the wmb() here. It translates to an sfence which is only needed for writes to regions that have the write combining attribute set. The nvme driver does none of these. Now that x86's bus_dmamap_sync() includes a __compiler_membar, we can be assured the optimizer won't reorder the bus_dmamap_sync and the bus_space_write operations.
and
Annotate bus_dmamap_sync() with fence Add an explicit thread fence release before returning from bus_dmamap_sync. This should be a no-op in practice, but makes explicit that all ordinary stores will be completed before subsequent reads/writes to ordinary device memory. There is one exception. If you've mapped memory as write combining, then you will need to add a sfence or similar.