cdcheckmedia() has an asynchronous mode. As far as I can see, nothing
holds a reference to the periph while the asynchonous state machine is
executing. See
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276251#c14 for more
details.
Details
- Reviewers
ken imp - Group Reviewers
cam - Commits
- rG979e15bbf0cb: scsi_cd: Maintain a periph reference during media checks
rG8b620483bbd6: scsi_cd: Use a bool for the second parameter of cdcheckmedia()
rG212af7b6133a: scsi_cd: Maintain a periph reference during media checks
rGfe44b0cae643: scsi_cd: Use a bool for the second parameter of cdcheckmedia()
rGc961afe82596: scsi_cd: Maintain a periph reference during media checks
rGb1710124ff14: scsi_cd: Use a bool for the second parameter of cdcheckmedia()
I did some testing with a (populated) USB CD reader, unplugging it while it was probing the disc etc.
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 55514 Build 52403: arc lint + arc unit
Event Timeline
The int to bool thing likely should be a separate commit though.
We don't call the strategy routine w/o calling open. geom calls it to taste the drive, and it doesn't do that until it acqures access which calls d_open.
I don't think that we can be withered until tasting is complete.
And I don't think we fail the I/O in the devices... those have to complete before the withering proceeds....
But... cdoninvalidate will flush the queue of I/O requests with an error and call disk_gone... that would mean the I/O that's queued up waiting for the media check would be completed while the media check is running...
So doing an aquire I think is needed to keep the cdoninvalidate from running until this is done. While we had an acquire that we got from cdopen when we entered, the I/O that followed that open is queued, so that cdclose might run once geom gets that I/O failed. the cddiskgonecb doesn't release the acquire from open, but a special one we take just before creating the disk.
About the normal state: we enter that once we've 'probed' the CD in our startup state machine. So we're able to do enough I/O to the drive to get its device particulars.... So we're completing the single stepping probe, so that's not the issue I saw: the probe would single step and fail, but an async call would then try to do I/O on the now defunct device.
tl'dr: while I may quibble where you do the acquire, I think I've talked myself into agreeing it's necessary because the I/O from the open isn't enough to keep close from running.
sys/cam/scsi/scsi_cd.c | ||
---|---|---|
2699 | So if we get an error from acquire for the chstategy case, then there's nothing to clock the CD I/o that triggered this cdcheckmedia. However, that's if periph is NULL (not here) or if the flags say it is invalid. That means we're racing a destroy and should get out, so that's OK I think. The oninvalidate callback will take care of failing the I/O that caused this cdcheckmedia. Though I don't think this is the right place for it... Or maybe the conditional is wrong.. I'm unsure... |
Slight change of mind after explaining in the bug and thinking what I wrote wasn't quite right. I don't think sync vs async is a useful disctinction: we're sending I/O to the device to do a thing that the cd driver wants independent of what others want, and that needs an acquire even if other things we're doing might happen to hold a ref when we do it sync.
sys/cam/scsi/scsi_cd.c | ||
---|---|---|
2699 | That is to say, maybe we should always acquire here even in the sync cases. The sync cases have a ref that means they won't go away, true. But we're doing something that needs a ref and making it conditional of the 'need' likely is asking for trouble. If we always do it, then we'll be safe even if the sync cases do something weird like release the ref they started the mediacheck with out of band so that the device can go away while they are sleeping (that's likely also a bug, but the extra layer here may help us fail safe on that bug). So I don't think we should check do_wait here, but always take it out. I don't think we should move this like I suggested a few minutes ago to cdstrategy. Then in the done routine, we always release it. That would be more robust and more in keeping with a 'I need a reference to do a thing' rather than 'Oh, it might be unsafe, I'll take a reference to do that thing'. Acquire is cheap enough and this is infrequent. the xpt bus lock isn't contested. |
Ok.
We don't call the strategy routine w/o calling open. geom calls it to taste the drive, and it doesn't do that until it acqures access which calls d_open.
I don't think that we can be withered until tasting is complete.
And I don't think we fail the I/O in the devices... those have to complete before the withering proceeds....But... cdoninvalidate will flush the queue of I/O requests with an error and call disk_gone... that would mean the I/O that's queued up waiting for the media check would be completed while the media check is running...
So doing an aquire I think is needed to keep the cdoninvalidate from running until this is done. While we had an acquire that we got from cdopen when we entered, the I/O that followed that open is queued, so that cdclose might run once geom gets that I/O failed. the cddiskgonecb doesn't release the acquire from open, but a special one we take just before creating the disk.
Right.
About the normal state: we enter that once we've 'probed' the CD in our startup state machine. So we're able to do enough I/O to the drive to get its device particulars.... So we're completing the single stepping probe, so that's not the issue I saw: the probe would single step and fail, but an async call would then try to do I/O on the now defunct device.
Note that the first attachment in PR 276251 contains this line:
Jan 5 09:57:53 master kernel: cd0: 0MB (1 0 byte sectors)
Presumably we hadn't yet successfully queried the device, hence the async query from cdstrategy().
Note that the first attachment in PR 276251 contains this line:
Jan 5 09:57:53 master kernel: cd0: 0MB (1 0 byte sectors)Presumably we hadn't yet successfully queried the device, hence the async query from cdstrategy().
Kinda. If we print the above line, we have successfully queried the device. We went through our startup sequence, found there was no media in the device and made a note of that. So we're in the normal state now. However, since we noted we didn't have media cdstrategy now will do the mediachange query before any reads or writes are allowed through.
The only reason I parse carefully what you said is because entering normal mode is an important detail for this driver.
Ok, I agree.
sys/cam/scsi/scsi_cd.c | ||
---|---|---|
2699 | Note that cdstrategy() explicitly checks for an invalid periph before calling cdcheckmedia(), so I'm pretty sure an error can't actually occur here in the async case. So the idea behind always acquiring the ref is that it ensures that synchronous probing is protected even if the sleeping thread wakes up and releases its ref before that state machine has fully run? Ok, that seems reasonable. |