Page MenuHomeFreeBSD

Two commits to cleanup needless wmb() in nvme driver
ClosedPublic

Authored by imp on Dec 2 2020, 7:02 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Oct 28, 4:02 AM
Unknown Object (File)
Sun, Oct 27, 3:13 PM
Unknown Object (File)
Sun, Oct 27, 3:13 PM
Unknown Object (File)
Sun, Oct 27, 3:13 PM
Unknown Object (File)
Sun, Oct 27, 3:13 PM
Unknown Object (File)
Sun, Oct 27, 2:52 PM
Unknown Object (File)
Oct 3 2024, 11:26 AM
Unknown Object (File)
Sep 23 2024, 8:27 PM
Subscribers

Details

Summary
Remove a wmb() that's not necessary.

bus_dmamap_sync() is supposed to provide guarantees that ensure that
memory that's prepared for PREWRITE can be DMA'd immediately after it
returns.

For non-x86 platforms, bus_dmamap_sync() takes care of ensuring that
all writes to the command buffer has been posted well enough for the
device to initiate DMA from that memory and get that contents. They
all have the appropaite strength memory fence.

For x86 platforms, the memory ordering is already strong enough. Once
memory is written, the write to the uncached BAR to force the DMA to
the device will get its contents. As such, we don't need the wmb()
here. It translates to an sfence which is only needed for writes to
regions that have the write combining attribute set. The nvme driver
does none of these. Now that x86's bus_dmamap_sync() includes a
__compiler_membar, we can be assured the optimizer won't reorder the
bus_dmamap_sync and the bus_space_write operations.

and

Annotate bus_dmamap_sync() with fence

Add an explicit thread fence release before returning from
bus_dmamap_sync. This should be a no-op in practice, but makes
explicit that all ordinary stores will be completed before subsequent
reads/writes to ordinary device memory.

There is one exception. If you've mapped memory as write combining,
then you will need to add a sfence or similar.
Test Plan

smoke tested with a buildworld on nda zfs pool....

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

imp requested review of this revision.Dec 2 2020, 7:02 PM
imp added reviewers: mav, chuck, jimharris, mmel, jhibbits.
imp added inline comments.
sys/dev/nvme/nvme_qpair.c
985 ↗(On Diff #80237)

I'm half thinking this should be just ifdef x86, but I didn't want to change that just yet.

sys/dev/nvme/nvme_qpair.c
994 ↗(On Diff #80237)

The above comment is only correct if the sfence here is actually needed. kib@ makes a good case that it's not needed.

sys/dev/nvme/nvme_qpair.c
994 ↗(On Diff #80237)

And talking to Drew Gallatin, he's convinced me that the DMA the device does will be consistent with what was written, even if the physical DRAM doesn't contain the new contents. The volatile pointer dereference in bus_space_* will act as a memory bar to keep the compiler from reordering that call with the memcpy above.

I want to be stronger :) - to the best of my knowledge, wmb ()) is not needed at all or bus_dmamap_sync() is broken for x86. Nothing in between.
In current revision, the memory barrier should be ensured by calling bus_dmamap_sync() - (it's an external function and the buffer is not local variable, so compiler must expect access to these data).
I think that x86 architecture ensures right store ordering also for external observers so I think usage of wmb() (or any other explicit synchronization) in any driver is nothing but bug.

The only passible cause is situation where write combining may block right sequencing (line setting OWN bit on multiword descriptor).
In this case the explicit sync is necessary to ensure that full descriptor becomes visible to DMA before OWN bit is set. This should be covered by another type of bus_dmamap_sync() but we don't have it, as well as bus_dmamap_sync() with specific range, or 'cannot be bound' flag for bus_dmamap_create().

sys/dev/nvme/nvme_qpair.c
994 ↗(On Diff #80237)

volatile access does not provide barrier.

I think that x86 needs atomic_thread_fence_rel() there. I suspect that most arches that need any fence semantic not ensured by bus_dmamap_sync(), would be served by the fence_rel too.

In D27448#613489, @mmel wrote:

I want to be stronger :) - to the best of my knowledge, wmb ()) is not needed at all or bus_dmamap_sync() is broken for x86. Nothing in between.
In current revision, the memory barrier should be ensured by calling bus_dmamap_sync() - (it's an external function and the buffer is not local variable, so compiler must expect access to these data).
I think that x86 architecture ensures right store ordering also for external observers so I think usage of wmb() (or any other explicit synchronization) in any driver is nothing but bug.

The only passible cause is situation where write combining may block right sequencing (line setting OWN bit on multiword descriptor).
In this case the explicit sync is necessary to ensure that full descriptor becomes visible to DMA before OWN bit is set. This should be covered by another type of bus_dmamap_sync() but we don't have it, as well as bus_dmamap_sync() with specific range, or 'cannot be bound' flag for bus_dmamap_create().

After talking to Drew, I think that you're right....

Based on that, I think that the right thing to do is to just remove the wmb() entirely.

sys/dev/nvme/nvme_qpair.c
994 ↗(On Diff #80237)

When talking to Drew, we used "memory bar" to mean "compiler memory bar", which the volatile in bus_space will afford.

So that raises the question of whether we should put that into bus_dmamap_sync or not on x86.... The equivalent is in arm64, arm and powerpc dmamap routines if I'm reading them right.

arm64 defines atomic_thread_fence_rel() as dsb(sy), and has dsb(sy) for the PREWRITE case.
arm does similar things with dmb
powerpc64 does powerpc_sync() at the end of bus_dmamap(), which is stronger than the powerpc_lwsync() that fence_rel() uses.

Since bus_dmamap_sync() on x86 is no inlined, __compiler_membar() seems redundant since it's already implied with the bus_space() functions.

So in light of all that, what do you recommend kib@ I'm sympathetic to mmel's notion that needing extra things after the sync is a bug in the busdma* implementation.... It's my belief that the C standard already requires there to be an implicit compiler_membar() for the function call, but you're in a much better position to know that than I.

sys/dev/nvme/nvme_qpair.c
994 ↗(On Diff #80237)

Calling a function does not provide guarantees that __compiler_membar() does. Function call is only ensures seq point, which only meaningful for C virtual machine, it does not force compiler to move memory accesses over the place.

So I do believe that x86 needs __compiler_membar(). I do think that placing fence_rel() into i386/amd64 bus_dmamap_sync() is the right resolution.

sys/dev/nvme/nvme_qpair.c
994 ↗(On Diff #80237)

I don't agree with first paragraph. For all non-local data, calling external function forces compiler to don't move memory accesses over function call and it forces it to don't use cached data across call.
The called external function can change anything in memory and compiler cannot predict if given data are changed or not.
Only not-exported local variables are exception.

About second paragraph and only for my education - does this mean that x86 architecture doesn't guarantee store order and/or visibility of store buffers to external observers? Or it is necessary as 'barrier' for possible write combining?

sys/dev/nvme/nvme_qpair.c
994 ↗(On Diff #80237)

I don't agree with first paragraph. For all non-local data, calling external function forces compiler to don't move memory accesses over function call and it forces it to don't use cached data across call.
The called external function can change anything in memory and compiler cannot predict if given data are changed or not.
Only not-exported local variables are exception.

This is mostly fine when you consider this guarantee as the feature of the current build.
The moment LTO is enabled, the guarantee disappears.

About second paragraph and only for my education - does this mean that x86 architecture doesn't guarantee store order and/or visibility of store buffers to external observers? Or it is necessary as 'barrier' for possible write combining?

I do not understand this question. Second paragraph you reference, talked about __compiler_membar(), which ensures that compiler cannot reorder memory accesses around it.

x86 guarantees that writes to write-back (and normal uncacheable) memory are observed in program order. It is not true for write-combining memory and in some other situations when special instructions are used.

sys/dev/nvme/nvme_qpair.c
994 ↗(On Diff #80237)

This is mostly fine when you consider this guarantee as the feature of the current build.
The moment LTO is enabled, the guarantee disappears.

well, LTO is very heavy/strong argument :) On opposite side putting __compiler_membar() onto top of bus_dmamap_sync() is almost no-op in current situation (i think) so why not.

x86 guarantees that writes to write-back (and normal uncacheable) memory are observed in program order. It is not true for write-combining memory and in some other situations when special instructions are used.

Thank you very much, that's exactly what I needed to explain.

imp edited the test plan for this revision. (Show Details)

Redo

  • Annotate bus_dmamap_sync() with fence
  • Remove a wmb() that's not necessary.

It is fine for x86, is it ok for mips/risc-v ? Or does nvme(4) not considered feasible on that arches ?

In D27448#613989, @kib wrote:

It is fine for x86, is it ok for mips/risc-v ? Or does nvme(4) not considered feasible on that arches ?

nvme isn't supported on mips nor risc-v. risc-v is a possibility, mips realistically is not.

However, looking at the mips busdma implementation, it wouldn't need a wmb() here as it does the right cache dances... though it might need a __compiler_membar() for LTO should we ever support that (which seems unlikely).
riscv has a fence for the PREWRITE operation, and is likely fine (esp since wmb() maps to fence() on riscv)...

I'll update the commit message(s) here in a bit since they are radically different from the current one

imp retitled this revision from nvme: Explain why we have a wmb() when updating the sq tailq pointer to Two commits to cleanup needless wmb() in nvme driver.Dec 4 2020, 5:57 PM
imp edited the summary of this revision. (Show Details)
imp edited the test plan for this revision. (Show Details)
This revision is now accepted and ready to land.Dec 4 2020, 6:15 PM
gallatin added a reviewer: gallatin.

This looks correct to me.

Thanks kib@ and mmel@ for helping me understand!