Details

Reviewers

mav
jrtc27
kib
chuck

Commits

rGf76c34659f9d: nvme: coherently read status of completion records
rGffb294bd3157: nvme: coherently read status of completion records
rGaa0ab681ae75: nvme: coherently read status of completion records

Summary

When reading a completion record, avoid a race with the device. If the
host starts to read the completion record and then the device updates it
while we're reading it, we can have the early part of the record be old
and the later part of the record be new. This leads us to mistakenly
think that the record is in phase and we use the old values and look
at an already completed entry, which has no current tracker.

To work around this problem, we atomically read the status with acquire
semantics. If it's in phase, we then re-read the entire completion
record. In addition we resync the dmatag to reflect changes since the
prior loop for the bouncing dma case.

Found by: jrtc27 (this fix is based in part on her D30995 fix)
Sponsored by: Netflix

Test Plan

A slightly different approach to fix this race than in D30995. We should consider both

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 40240
Build 37129: arc lint + arc unit

Event Timeline

imp created this revision.Jul 2 2021, 2:16 PM

Herald added a subscriber: dab. · View Herald TranscriptJul 2 2021, 2:16 PM

imp requested review of this revision.Jul 2 2021, 2:16 PM

Harbormaster completed remote builds in B40236: Diff 91669.Jul 2 2021, 2:16 PM

imp mentioned this in D31001: nvme: Fix alignment on nvme structures.Jul 2 2021, 2:17 PM

imp added reviewers: mav, jrtc27, kib, chuck.

imp added a parent revision: D31001: nvme: Fix alignment on nvme structures.

imp edited the summary of this revision. (Show Details)Jul 2 2021, 2:19 PM

imp edited the test plan for this revision. (Show Details)

imp mentioned this in D30995: nvme: Fix race condition in nvme_qpair_process_completions.

I don't think you can skip it on the first iteration:

Host                  Drive

                   Write A.CID
                        |
                        V
                  Write A.STATUS
                        |
                        V
               +--- Send IRQ
               |        |
Receive IRQ <--+        |
     |                  V
     |             Write B.CID
     |                  |
     |                  V
     |            Write B.STATUS
     |                  |
     V                  |
   Sync                 |
     |                  |
     V                  |
 Read A.CID             |
     |                  |
     V                  |
Read A.STATUS           |
     |                  |
     V                  |
  Process               |
     |                  |
     V                  |
Read B.CID              |
     |                  |
     V                  |
Read B.STATUS           |
     |                  |
     V                  |
  Process               |
     |                  V
     |         +--- Send IRQ
     V         |        |
Receive IRQ <--+        |
     |                  |
     V                  |
Read C.CID              |
     |                  V
     |             Write C.CID
     |                  |
     |                  V
     |            Write C.STATUS
     V
Read C.STATUS

Well, at least in the bouncing case, since the load acquire doesn't actually help you. I think in the coherent non-bouncing DMA case the load acquire does address the issue, but also the sync isn't needed at all there.

FWIW, the completion handler in Linux does something similar, but it orders the memory synchronization differently. Effectively, it does

while ( phase_matches() ) {
    DMA_read_memory_barrier();
    process_completion():
}

where the phase_matches() function reads status and masks the phase bit.

In D31002#697523, @chuck wrote:
FWIW, the completion handler in Linux does something similar, but it orders the memory synchronization differently. Effectively, it does
while ( phase_matches() ) {
    DMA_read_memory_barrier();
    process_completion():
}
where the phase_matches() function reads status and masks the phase bit.

Yes. That's effectively the order that Jessica's patch has. And I agree with her there is no 'first' here, so I'll update this patch with that.

fold in Jessica's observation that sync is need, as well as chuck's feedback.

Harbormaster completed remote builds in B40239: Diff 91673.Jul 2 2021, 3:48 PM

silly compiler error due to too many retypings of the same code :(.

Harbormaster completed remote builds in B40240: Diff 91674.Jul 2 2021, 3:52 PM

imp added inline comments.Jul 2 2021, 3:55 PM

sys/dev/nvme/nvme_qpair.c
598	with the sync below (and above), do I need the atomic here?

imp edited the summary of this revision. (Show Details)Jul 2 2021, 4:01 PM

jrtc27 added inline comments.Jul 2 2021, 4:03 PM

sys/dev/nvme/nvme_qpair.c
592	cid not sqid, we don't look at the latter.
598	No, it can then just be a direct read of the field. I was hoping my version would be optimised by the compiler to be entirely equivalent, but it seems not... so I think reading just the field rather than copying the struct like I did is a good idea. I don't know why it didn't optimise it though, it really should have been able to (and the pass that does so is one I'm far too familiar with...).
607	I'd only check the phase. status is 2 bytes so the initial read can technically tear if you're not careful; at least it will currently on riscv due to the packed struct, i.e. the byte of status that doesn't include the phase bit could have been read before and inconsistent with the byte that does include the phase bit.

chuck added inline comments.Jul 2 2021, 4:08 PM

sys/dev/nvme/nvme_qpair.c
598	For a compliant device, the NVMe specification says: If a completion queue entry is constructed via multiple writes, the Phase Tag bit shall be updated in the last write of that completion queue entry. I think this implies that the sync's are sufficient and the atomic isn't needed.

imp marked 4 inline comments as done.Jul 2 2021, 4:51 PM

imp added inline comments.

sys/dev/nvme/nvme_qpair.c
592	tweaked.
598	thanks guys. removed.
607	Gotcha. I'd prefer to over-test, but if it doesn't matter, then it's better to not, even though the result is a bit awkward.

review comments: remove atomic, tweak wording, test phase.

Harbormaster completed remote builds in B40243: Diff 91677.Jul 2 2021, 4:52 PM

jrtc27 added inline comments.Jul 2 2021, 4:54 PM

sys/dev/nvme/nvme_qpair.c
607	Somehow I forgot to mention in my previous comment (but thought it, and in fact was what made me then think about the assertion itself) that the message for this assertion doesn't make sense to me?

fix assert message.

Harbormaster completed remote builds in B40244: Diff 91678.Jul 2 2021, 5:02 PM

imp marked an inline comment as done.Jul 2 2021, 5:02 PM

imp added inline comments.

sys/dev/nvme/nvme_qpair.c
607	Looks like I'm so busted for copying and pasting, eh? Fixed.

I nearly wrote this exact diff rather than the one I ended up with; I didn't, for simplicity, but I do think it's better than my less efficient version.

This revision is now accepted and ready to land.Jul 2 2021, 5:14 PM

Subject and description will need updating to reflect the final patch before committing though

In D31002#697580, @jrtc27 wrote:

Subject and description will need updating to reflect the final patch before committing though

Yes. I'll update, but changing that in mid review messes up git-arc unless done very carefully :(

In D31002#697581, @imp wrote:

In D31002#697580, @jrtc27 wrote:

Subject and description will need updating to reflect the final patch before committing though

Yes. I'll update, but changing that in mid review messes up git-arc unless done very carefully :(