Page MenuHomeFreeBSD

cam/da: Call cam_periph_invalidate on ENXIO in dadone
Needs ReviewPublic

Authored by imp on Sat, Jan 25, 8:10 PM.

Details

Reviewers
mav
chs
ken
Group Reviewers
cam
Summary

Use cam_periph_invalidate() instead of just setting the PACK_INVALID
flag in the da softc. It's a more appropriate and bigger hammer for this
case. PACK_INVALID is set as part of that, so remove the now-redundant
setting, but use it to only complain the first time we hit the
condition. This also has the side effect of short-circuiting errors for
other I/O still in the drive which is just about to fail (sometimes with
different error codes than what triggered this ENXIO).

The prior practice of just setting the PACK_INVALID flag, however, was
too ephemeral to be effective.. Since daopen would clear PACK_INVALID
after a successful open, we'd have to rediscover the error (which takes
tens of seconds) for every different geom tasting the drive. These two
factors lead to a watchdog before we could get through all the devices
if we had multiple failed drives with this syndrome. By invalidating the
periph, we fail fast enough to reboot enough to start petting the
watchdog. If we disable the watchdog, the tasting eventually completes,
but takes over an hour which is too long. As it is, it takes an extra
minute per failed drive, which is tolerable.

While cam_periph_error's asc/ascq tables have a SSQ_LOST flag for this
situation, that flag also fails the other periph drivers, like pass,
attached to the device. The docs for these codes are too sparse to help
decide what to do. Err on the side of caution and only invalidate the da
device. Simple commands to collect logs for the vendor still work w/o
hangling the system or other adverse effects. Therefore, I've not added
SSQ_LOST to the asc/ascq table entries for the newer codes.

We can also simplify the logic w/o bloating the change, so do that too.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 62032
Build 58916: arc lint + arc unit

Event Timeline

The purpose of invalidating the pack in the da(4) driver is to deal with removable media. If the removable disk is pulled while the device is open, the pack is invalidated. Getting through a successful open again clears the flag.

With this change, we wind up invalidating the device, which isn't what we want to do with removable media.

We probably need to distinguish between SIP_MEDIA_FIXED and SIP_MEDIA_REMOVABLE in terms of how we handle errors like this. i.e. you can still get the fast fail behavior you're looking for (which I think is helpful) for fixed disks.

In D48689#1110375, @ken wrote:

The purpose of invalidating the pack in the da(4) driver is to deal with removable media. If the removable disk is pulled while the device is open, the pack is invalidated. Getting through a successful open again clears the flag.

With this change, we wind up invalidating the device, which isn't what we want to do with removable media.

We probably need to distinguish between SIP_MEDIA_FIXED and SIP_MEDIA_REMOVABLE in terms of how we handle errors like this. i.e. you can still get the fast fail behavior you're looking for (which I think is helpful) for fixed disks.

We already do the right thing for the ASC/ASCQ that indicate that the media is gone / changed. For those (mostly 3a) ASC codes we just set the PACK_INVALID bit. The removable media won't change with this change. the error codes we have flagged for EXNIO are typically the drive reporting that some subsystem of the drive has failed, that it can't pass diagnostics, etc.

However, the ENXIO error from daerror indicates that the ASC/ASCQ code from the I/O is believed to always be fatal and no further I/O to the LBAs will be possible. This is true independent of whether the drive has removable media or not. For those that might affect removable media, they are typically marked retry though there may be one or two marked as EIO. EIO is returned from daerror/cam_periph_error() for ordinary I/O errors. Plus we have extra logic in daerror() to cope with things like 3a which explicitly report the media is gone. In those cases, we'll not invalidate the da periph, but just set the PACK_INVALID bit. So unless a removable device fails with these asc/ascq codes and that failure can be fixed by popping in a different disk, I think that this is OK to do for all disks. My experience in the past, admittedly with just JAZ and ZIP drives, is that when there's an error, it will be some kind of media error or some kind of timeout, neither of which will trigger this path.

Without this change, and because we clear PACK_INVALID in daopen unconditionally (or nearly so: we only fail when the periph has already been invalidated), failing drives will be retried over and over in the geom tasting process at boot, which takes tens of minutes to complete per failed drive. 2 or 3 failed drives like this is enough to hit watchdog preventing coming up at all.

OK. I've read though the code another half dozen times. After the first 5 I thought I was still right. But on the 6th time I realized that another change is needed.... this code is suhtle.

The problem is that the current code is serving two different master. There's two errors that are signaled with ENXIO. One is 'the drive can never do I/O again', and the other adds 'until new media is inserted'. to the end. Both cd and da already set a bit to say the media is invalid when we see this 0x3a asc. But then try to set it again when we get ENXIO. I'd like to make ENXIO only for errors that aren't just 'media missing' and have another error for media missing. So w/o changing that for asc 0x3a, you're right, this does break removable media, at least for da.

So there's a few alternatives.

(1) Just do this for non-removeable drives, like you suggest. This is fine, for me, but we'd still have the case where we're talking to a mostly broken drive that has removable media, but the problems aren't just missing media.
(2) I can add SSQ_LOST to a bunch more ASC/ASCQ entries (basically all the ENXIO errors, except 0x3a). This would indirectly cause invalidate the da device, but also the pass device, which I'm trying to avoid.
(3) I could keep the test for PACK_INVALID for the whole if. That way we don't do it for the 0x3a that we saw on media missing errors (since we set that in daerror), but do it for all the other asc/acsq codes. I'd still invalidate then, but since we don't do it with an already invalid pack, the removable media devices would still work. It's imperfect in the face of multiple types of errors, So if the drive really goes away with the media missing, we have two cases. if the SIM signals it with CAM_DEV_NOT_THERE, then we'll do the SSQ_LOST action so that's good. If it's less than a total failure to the SIM, but just a really bad ASC / ASCQ pair, then we'll not invalidate the device. That might be a rare enough event to be OK: we'd already not be doing I/O to the drive, so not having the periph go away is likely fine. We'll clear the PACK_INVALID on the next daopen, and we'd get the other ASC then.
(4) I could invent a new error code for this case. Something between EIO and ENXIO. Let's call it EAGAIN_IO, but I'd have to make sure it doesn't leak outside of CAM. That's a more precise signal, but a bigger hassle.

I'll see if I can make (3) work and test it with qemu + scsi cd at least. I'll see if there's some VM that can also implement a ZIP or JAZZ drive. I no longer have removable da devices, though, that aren't usb sticks that have the SIM and periph go away too. If I can't make (3) work, I'll fall back to (1).

Looks like qemu can do removable da disks. So I should be able to test both.
And I also noticed that linux has ENOMEDIUM for my idea of having something between EIO and ENXIO. Too bad it's so hard to add a new error type because the size of the errno string array is baked into static binaries...

So thanks for the feedback, and sorry I was a bit unbelieving of your feedback at first.