nvme: Greatly improve error recovery
ClosedPublic
Actions

Authored by imp on Oct 10 2022, 2:15 AM.

Details

Reviewers

mav
chs
chuck
jhb

Commits

rG1a25315863e5: nvme: Greatly improve error recovery
rG5d627e0669c5: nvme: Greatly improve error recovery
rGd4959bfcd110: nvme: Greatly improve error recovery

Summary

Next phase of error recovery: Eliminate the REOVERY_START phase, since
we don't need to wait to start recovery. Eliminate the RECOVERY_RESET
phase since it is transient, we now transition from RECOVERY_NORMAL into
RECOVERY_WAITING.

In normal mode, read the status of the controller. If it is in failed
state, or appears to be hot-plugged, jump directly to reset which will
sort out the proper things to do. This will cause all pending I/O to
complete with an abort status before the reset.

When in the NORMAL state, call the interrupt handler. This will complete
all pending transactions when interrupts are broken or temporarily
misbehaving. We then check all the pending completions for timeouts. If
we have abort enabled, then we'll send an abort. Otherwise we'll assume
the controller is wedged and needs a reset. By calling the interrupt
handler here, we'll avoid an issue with the current code where we
transitioned to RECOVERY_START which prevented any completions from
happening. Now completions happen. In addition and follow-on I/O that is
scheduled in the completion routines will be submitted, rather than
queued, because the recovery state is correct. This also fixes a problem
where I/O would timeout, but never complete, leading to hung I/O.

Resetting remains the same as before, just when we chose to reset has
changed.

A nice side effect of these changes is that we now do I/O when
interrupts to the card are totally broken. Followon commits will improve
the error reporting and logging when this happens. Performance will be
aweful, but will at least be minimally functional.

MFC After: 3 days
Sponsored by: Netflix

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

imp created this revision.Oct 10 2022, 2:15 AM

Herald added a subscriber: dab. · View Herald TranscriptOct 10 2022, 2:15 AM

imp requested review of this revision.Oct 10 2022, 2:15 AM

Harbormaster completed remote builds in B47759: Diff 111600.Oct 10 2022, 2:15 AM

imp added a parent revision: D36921: nvme: Timeout expired transactions.Oct 10 2022, 2:15 AM

imp added reviewers: mav, chs, chuck.Oct 10 2022, 2:17 AM

move abort completion into D36921

Harbormaster completed remote builds in B47761: Diff 111602.Oct 10 2022, 4:17 AM

mav added inline comments.Oct 10 2022, 2:25 PM

sys/dev/nvme/nvme_qpair.c
996	There is NVME_GONE define for this, actually all-1s also means cfs set, so this check for NVME_GONE does not add much.
1008	Is this safe to call it in parallel to active interrupts? I don't see any locking there.

adam.e.peace_gmail.com added a subscriber: adam.e.peace_gmail.com.Jan 31 2023, 10:24 PM

rebase

Harbormaster completed remote builds in B53075: Diff 125937.Aug 12 2023, 4:42 PM

imp marked 2 inline comments as done.Aug 12 2023, 4:56 PM

imp added inline comments.

sys/dev/nvme/nvme_qpair.c
996	Will use NVME_GONE. I'll agree that the csts check doesn't add much. It's more about intent, though that can be covered by the comments adequately so I'll remove it.
1008	I keep going round and round about how safe it is (but although it is safer than it used to be, this is a great question because I don't think it's safe enough after looking closely). The qpair lock is used to protect qpair state, so it's safe from that perspective... But it doesn't protect all the qpair state (even if I fixed a couple of unprotected increments). And it doesn't protect the hardware against parallel access, which would be a problem. We already make this call, so it's not a new problem.... But it's a problem that should be fixed none-the-less. I have an ideal I'll explore and post a review on and a pointer here. IIRC, the one issue I had last time I tried to do locking was in the case of a panic while we happened to be in the completion routine: We can't have only one in the completion routine if we need to poll to write crash dumps.... But I think that's surmountable. For the case I'm trying to fix, there's no races since there's no interrupts, but for the more typical case, https://reviews.freebsd.org/D36924 is what's needed, but there may be better, more performant ways to accomplish this.

emaste added a subscriber: emaste.Aug 12 2023, 5:13 PM

imp added a reviewer: jhb.Aug 13 2023, 4:04 PM

imp added a child revision: D41452: nvme: Add exclusion for ISR.Aug 14 2023, 7:25 PM

jhb added inline comments.Aug 14 2023, 8:13 PM

sys/dev/nvme/nvme_qpair.c
984	I'm surprised this is locking manually rather than using callout_init_mtx().
1007	Dropping the lock here means you can't depend on callout_stop() under the lock not having races. If nvme_qpair_process_completions just locks it again, maybe add a _locked variant of nvme_qpair_process_completions this can call? (Or make all the callers acquire the lock?)