Page MenuHomeFreeBSD

cam: Include more statuses as errors for CAM_IO_STATS
ClosedPublic

Authored by imp on Jan 7 2022, 6:15 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Nov 8, 3:55 AM
Unknown Object (File)
Thu, Nov 7, 4:06 PM
Unknown Object (File)
Oct 18 2024, 11:06 AM
Unknown Object (File)
Oct 17 2024, 12:38 AM
Unknown Object (File)
Oct 15 2024, 7:49 AM
Unknown Object (File)
Oct 14 2024, 4:29 PM
Unknown Object (File)
Oct 11 2024, 5:09 PM
Unknown Object (File)
Oct 10 2024, 11:35 AM
Subscribers

Details

Summary

Tag more status return values as an error for the
I/O. CAM_SCSI_STATUS_ERROR is returned for medium errors, for example,
but the counts weren't increased. The added errors all indicate a
problem with the device request.

Sponsored by: Netflix
PR: 260257

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 43770
Build 40658: arc lint + arc unit

Event Timeline

imp requested review of this revision.Jan 7 2022, 6:15 PM

What's the difference between CAM_CMD_TIMEOUT and CAM_SEL_TIMEOUT ?

What's the difference between CAM_CMD_TIMEOUT and CAM_SEL_TIMEOUT ?

A selection timeout means we can't talk to the drive at all. The terminology comes from Parallel SCSI, but it is an error that generally results in the drive going away. A command timeout means that the drive hasn't completed a command within the timeout period. That typically would result in a retry.

In D33783#764108, @ken wrote:

What's the difference between CAM_CMD_TIMEOUT and CAM_SEL_TIMEOUT ?

A selection timeout means we can't talk to the drive at all. The terminology comes from Parallel SCSI, but it is an error that generally results in the drive going away. A command timeout means that the drive hasn't completed a command within the timeout period. That typically would result in a retry.

Yea, I thought that would be a good one to include in the error count, but it's meaning is a bit fuzzy in my mind...

This revision is now accepted and ready to land.Jan 7 2022, 6:30 PM

Does it make sense to include any of these additional errors for the CAM_IO_STATS in sys/cam/nvme/nvme_da.c and sys/cam/ata/ata_da.c?

In D33783#764113, @rew wrote:

Does it make sense to include any of these additional errors for the CAM_IO_STATS in sys/cam/nvme/nvme_da.c and sys/cam/ata/ata_da.c?

nvme_da doesn't have these errors. ata_da already lists CAM_ATA_STATUS_ERROR. I needed to add it to SCSI because we do a lot with ATA passthrough commands that might result in this. I'm on the fence about adding CAM_SMP_STATUS_ERROR to ata_da.c, like I was in adding it here. It's more of a management thing and I've not audited all the commands both send to see if it's possible... CAM_SEL_TIMEOUT is generated by the ata SIMs, so maybe it should be on the list there (nvme_sim doesn't generate this).

Whether it should count as a timeout or an error I was also on the fence about, but came down more on the side of an error since it's more of a device isn't there at all signal, rather than a command submitted to a device hasn't completed error, though I could be pushed one way or the other.

add sel_timeout as an error for ata too.

This revision now requires review to proceed.Jan 7 2022, 7:14 PM

Though thinking about it, I've not verified that the SEL_TIMEOUT error would make it here at all...

In D33783#764114, @imp wrote:
In D33783#764113, @rew wrote:

Does it make sense to include any of these additional errors for the CAM_IO_STATS in sys/cam/nvme/nvme_da.c and sys/cam/ata/ata_da.c?

nvme_da doesn't have these errors. ata_da already lists CAM_ATA_STATUS_ERROR. I needed to add it to SCSI because we do a lot with ATA passthrough commands that might result in this. I'm on the fence about adding CAM_SMP_STATUS_ERROR to ata_da.c, like I was in adding it here. It's more of a management thing and I've not audited all the commands both send to see if it's possible... CAM_SEL_TIMEOUT is generated by the ata SIMs, so maybe it should be on the list there (nvme_sim doesn't generate this).

CAM_SMP_STATUS_ERROR is only going to come back from SMP commands, and those aren't sent by the da(4) driver currently. They don't make any sense in ada(4), because that isn't going to be talking to a SAS controller. They are directed to an SMP target, which is contained inside an expander. Usually the expander has an enclosure processor target and an SMP target in it, and so we direct SMP commands via passthrough to the SES devices. e.g.:

camcontrol smpplist ses0 -v

If there isn't a visible enclosure processor for the expander that has the SMP target you want to talk to, you can direct the SMP command to a disk device, and the mps(4) or mpr(4) driver will actually send it to the parent device in the topology, which would be the expander. So in those cases, you would do:

camcontrol smpplist da5 -v

This stuff is necessary because SMP isn't a first class protocol in CAM with its own specific XPT routines, probe code, etc. I added it primarily to allow turning PHYs on and off which makes disk arrival and departure testing much easier.

Remove SEL_TIMEOUT. Those errors aren't loggged to devctl.
Remove SMP_ERROR too, per ken's reply

This revision was not accepted when it landed; it landed in state Needs Review.Jan 9 2022, 5:18 PM
This revision was automatically updated to reflect the committed changes.