Details

Reviewers

mav
chuck
chs

Commits

rG9bbd0a7ca999: nvme: Use shared timeout rather than timeout per transaction
rG502dc84a8b67: nvme: Use shared timeout rather than timeout per transaction

Summary

Keep track of the approximate time commands are 'due' and the next
deadline for a command. twice a second, wake up to see if any commands
have entered timeout. If so, quiessce and then enter a recovery mode
half the timeout further in the future to allow the ISR to
complete. Once we exit recovery mode, send a command with no negative
effect to the drive to ensure that any commands that completed after
we last poll are handled in a timely manner.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

imp created this revision.Feb 11 2021, 1:02 AM

Herald added a subscriber: dab. · View Herald TranscriptFeb 11 2021, 1:02 AM

imp requested review of this revision.Feb 11 2021, 1:02 AM

Harbormaster completed remote builds in B36880: Diff 83669.Feb 11 2021, 1:02 AM

imp added reviewers: mav, chuck, chs.Feb 11 2021, 1:06 AM

chuck added inline comments.Feb 12 2021, 5:48 PM

sys/dev/nvme/nvme_private.h
317	"dying" makes sense, but perhaps the intent would be clearer if the variable name more close matched the function name. `is_destruct` would be the most on point, but `is_destructing` or `is_destroying` might work as well.
sys/dev/nvme/nvme_qpair.c
542	Perhaps update comment to "qpair is recovering, likely because a controller reset is in"
1035	Is there a scenario in which `deadline` isn't initialized? For example, if the timer is armed and the current request is an AER (i.e. `req->timeout == false`). Will a second AER set `qpair->deadline` to whatever value is in the previous AER's `timeout`? If so, does it matter?

imp added inline comments.Feb 12 2021, 6:14 PM

sys/dev/nvme/nvme_private.h
317	Dying is used elsewhere in the kernel for this purpose, but I'd be interested to hear what others have to say.
sys/dev/nvme/nvme_qpair.c
542	Good point
1035	in practice, this hasn't been an issues, but the intent here is that we want to set the deadline to the oldest item still in the outstanding request list with a timeout. In debug writes I've seen this scenario printed, but didn't understand it. I'd see the timeout flip between SBT_MAX and sane values on a time scale of a few hours. And this scenario (or a similar one) would explain it. We recover fairly quickly, but that's in a Netflix workload where we have ~1crapton of outstanding I/O at any time and we seem to rotate through them quickly... So this is a very good question.

I like the general direction. I also was thinking about something like this to avoid callout operations on every I/O.

I haven't looked very close, but have some comments inline. Plus I am not sure we really need deadline variable. I think we may just traverse through the outstanding_tr queue and stop when timeout of some command is still ahead. In most cases it would mean just looking on the first tracker. I don't think we should really look deeper, since I am not sure it is specified anywhere that controller should fetch and start processing all command in parallel. And if it doesn't and commands are dependent, then we probably should not time out following command before the earlier one(s). Plus it makes the code more simple. ;)

sys/dev/nvme/nvme_qpair.c
1014	There is callout_schedule() for rescheduling with less arguments.
1036	Why to have variable if it is always the same?
1042	Should we really postulate that we have no different timeouts? We already have two actually: normal and 1s for polling.

Update with comments

remove trying to cache the right deadline (just walk through)
allow reset to confirm if there's really hotplug event

Harbormaster completed remote builds in B37789: Diff 85607.Mar 11 2021, 11:20 PM

Include more feedback from mav@

Harbormaster completed remote builds in B37790: Diff 85608.Mar 12 2021, 2:05 AM

imp marked 2 inline comments as done.Mar 12 2021, 2:08 AM

imp added inline comments.

sys/dev/nvme/nvme_qpair.c
1035	btw, good eye. this was the source of the infinite reset bug we saw in testing as soon as one of the AER's fired.
1042	Yea, the polling is set to he 1/2 of the shortest reasonable timeout. We have user specified, 1s and 30s timeouts. So I selected hz / 2...

imp marked an inline comment as done.Mar 13 2021, 5:16 PM

imp added inline comments.

sys/dev/nvme/nvme_qpair.c
1042	Originally, I'd wanted to have a single timeout that was half of the longest scheduled timeout. But having it fixed at 1/2s was easier to code. I'm not entirely sure the benefit from longer polling periods, but I'll keep that in mind if I start to see contention introduced by the 1/2s timeout.
1255	Now that we are manipulating the tr entries here w/o a lock and when we arrive here via timeout, we could be racing the submission process which adjusts the tr w/o the qpair lock and other threads completing other trs which remove things from the list (while we won't crash, we may stop prematurely and/or try to concurrently fail an item that's otherwise completing). Solving these races requires some careful thought. I've had one crash in the three weeks we've been running these patches fleet wide at netflix that points to this issue.

mav added inline comments.Mar 22 2021, 12:50 AM

sys/dev/nvme/nvme_private.h
180	This seems to be unused now.
sys/dev/nvme/nvme_qpair.c
978	If noting is complete, why is it a missed interrupt?
1144	What's the point of this condition? I suppose the goal of below assignment is to avoid false reaction on tracker that still inside busdma, that may take some time, but then the assignment should also be done.

imp added inline comments.Mar 22 2021, 1:42 AM

sys/dev/nvme/nvme_private.h
180	It is
sys/dev/nvme/nvme_qpair.c
978	If we find work done in the timeout that we should have found in the ISR, then that's a classic symptom of a missed interrupt. We should have had an interrupt to notify us. Instead we timed out and found work.... normally we'd expect to find things are stuck and have to cancel / reset to fix. Though to be pedantic, it's only a missed interrupt if the completions we find are past their deadline and not some other work.
1144	I think you may be right this is unwise here. I need to walk through the race I found in this code to see if this even makes sense anymore

imp mentioned this in D26767: nvme: Use shared timeout rather than timeout per transaction.Jul 12 2021, 11:04 PM

Rebase and update to latest netflix code

Harbormaster completed remote builds in B41577: Diff 95326.Sep 17 2021, 10:17 PM

AH, one more updated need that I overlooked.

sys/dev/nvme/nvme_private.h
180	OK. This has been updated to use it.
sys/dev/nvme/nvme_qpair.c
978	I'll reword. I don't think what I said here is quite right in hindsight.

Update printfs to be better (per mav@'s critique)
Remove comment that I don't think is true anymore (but was true in earlier code)

Harbormaster completed remote builds in B41579: Diff 95328.Sep 17 2021, 10:26 PM

This revision was not accepted when it landed; it landed in state Needs Review.Sep 24 2021, 12:15 AM

Closed by commit rG502dc84a8b67: nvme: Use shared timeout rather than timeout per transaction (authored by imp). · Explain Why

This revision was automatically updated to reflect the committed changes.

imp added a commit: rG502dc84a8b67: nvme: Use shared timeout rather than timeout per transaction.

mav added a commit: rG9bbd0a7ca999: nvme: Use shared timeout rather than timeout per transaction.Jan 21 2022, 2:27 AM

nvme: Use shared timeout rather than timeout per transaction
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 95594

sys/dev/nvme/nvme_ctrlr.c

sys/dev/nvme/nvme_private.h

sys/dev/nvme/nvme_qpair.c

nvme: Use shared timeout rather than timeout per transactionClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 95594

sys/dev/nvme/nvme_ctrlr.c

sys/dev/nvme/nvme_private.h

sys/dev/nvme/nvme_qpair.c

nvme: Use shared timeout rather than timeout per transaction
ClosedPublic
Actions

Revision Contents
Changeset List