Keep track of the approximate time commands are 'due' and the next
deadline for a command. twice a second, wake up to see if any commands
have entered timeout. If so, quiessce and then enter a recovery mode
half the timeout further in the future to allow the ISR to
complete. Once we exit recovery mode, send a command with no negative
effect to the drive to ensure that any commands that completed after
we last poll are handled in a timely manner.
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
sys/dev/nvme/nvme_private.h | ||
---|---|---|
317 | "dying" makes sense, but perhaps the intent would be clearer if the variable name more close matched the function name. is_destruct would be the most on point, but is_destructing or is_destroying might work as well. | |
sys/dev/nvme/nvme_qpair.c | ||
542 | Perhaps update comment to "qpair is recovering, likely because a controller reset is in" | |
1035 | Is there a scenario in which deadline isn't initialized? For example, if the timer is armed and the current request is an AER (i.e. req->timeout == false). Will a second AER set qpair->deadline to whatever value is in the previous AER's timeout? If so, does it matter? |
sys/dev/nvme/nvme_private.h | ||
---|---|---|
317 | Dying is used elsewhere in the kernel for this purpose, but I'd be interested to hear what others have to say. | |
sys/dev/nvme/nvme_qpair.c | ||
542 | Good point | |
1035 | in practice, this hasn't been an issues, but the intent here is that we want to set the deadline to the oldest item still in the outstanding request list with a timeout. In debug writes I've seen this scenario printed, but didn't understand it. I'd see the timeout flip between SBT_MAX and sane values on a time scale of a few hours. And this scenario (or a similar one) would explain it. We recover fairly quickly, but that's in a Netflix workload where we have ~1crapton of outstanding I/O at any time and we seem to rotate through them quickly... So this is a very good question. |
I like the general direction. I also was thinking about something like this to avoid callout operations on every I/O.
I haven't looked very close, but have some comments inline. Plus I am not sure we really need deadline variable. I think we may just traverse through the outstanding_tr queue and stop when timeout of some command is still ahead. In most cases it would mean just looking on the first tracker. I don't think we should really look deeper, since I am not sure it is specified anywhere that controller should fetch and start processing all command in parallel. And if it doesn't and commands are dependent, then we probably should not time out following command before the earlier one(s). Plus it makes the code more simple. ;)
sys/dev/nvme/nvme_qpair.c | ||
---|---|---|
1014 | There is callout_schedule() for rescheduling with less arguments. | |
1036 | Why to have variable if it is always the same? | |
1042 | Should we really postulate that we have no different timeouts? We already have two actually: normal and 1s for polling. |
Update with comments
- remove trying to cache the right deadline (just walk through)
- allow reset to confirm if there's really hotplug event
sys/dev/nvme/nvme_qpair.c | ||
---|---|---|
1042 | Originally, I'd wanted to have a single timeout that was half of the longest scheduled timeout. But having it fixed at 1/2s was easier to code. I'm not entirely sure the benefit from longer polling periods, but I'll keep that in mind if I start to see contention introduced by the 1/2s timeout. | |
1255 | Now that we are manipulating the tr entries here w/o a lock and when we arrive here via timeout, we could be racing the submission process which adjusts the tr w/o the qpair lock and other threads completing other trs which remove things from the list (while we won't crash, we may stop prematurely and/or try to concurrently fail an item that's otherwise completing). Solving these races requires some careful thought. I've had one crash in the three weeks we've been running these patches fleet wide at netflix that points to this issue. |
sys/dev/nvme/nvme_private.h | ||
---|---|---|
180 | This seems to be unused now. | |
sys/dev/nvme/nvme_qpair.c | ||
978 | If noting is complete, why is it a missed interrupt? | |
1144 | What's the point of this condition? I suppose the goal of below assignment is to avoid false reaction on tracker that still inside busdma, that may take some time, but then the assignment should also be done. |
sys/dev/nvme/nvme_private.h | ||
---|---|---|
180 | It is | |
sys/dev/nvme/nvme_qpair.c | ||
978 | If we find work done in the timeout that we should have found in the ISR, then that's a classic symptom of a missed interrupt. We should have had an interrupt to notify us. Instead we timed out and found work.... normally we'd expect to find things are stuck and have to cancel / reset to fix. Though to be pedantic, it's only a missed interrupt if the completions we find are past their deadline and not some other work. | |
1144 | I think you may be right this is unwise here. I need to walk through the race I found in this code to see if this even makes sense anymore |
Update printfs to be better (per mav@'s critique)
Remove comment that I don't think is true anymore (but was true in earlier code)