Page MenuHomeFreeBSD

cam: Log errors from passthru commands
ClosedPublic

Authored by imp on Jul 24 2023, 11:44 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 5, 8:49 AM
Unknown Object (File)
Thu, Dec 26, 4:04 AM
Unknown Object (File)
Nov 19 2024, 1:21 PM
Unknown Object (File)
Nov 19 2024, 11:36 AM
Unknown Object (File)
Oct 31 2024, 12:45 AM
Unknown Object (File)
Oct 3 2024, 5:24 AM
Unknown Object (File)
Oct 2 2024, 1:45 PM
Unknown Object (File)
Sep 13 2024, 6:17 AM
Subscribers
None

Details

Summary

Since a30ecd42b8e09 we've logged almost all unexpected errors from
commands. However, some passthru commands were not logged via devctl. To
fix this, pass all requests through passerror (which calls
cam_periph_error), but flag those requests that didn't want error
recovery as SF_NO_RECOVERY, like we do for device probing. By doing this
we get identical behavior to the current code, but log these errors.

We have had hangs on drives that seems to show no error. Vendor analysis
of the drive found an illegal command that happen to hang the drive. In
verifying their analysis, we discovered that the pass through commands
from things like smartctl that encountered errors or timeouts weren't
logged.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

imp requested review of this revision.Jul 24 2023, 11:44 PM

My position traditionally was that user-space should handle and report its own errors by itself, but kernel should handle some system-wide. I agree that there may be some merits in this. Aside of reporting to devctl this should also properly re-broadcast Unit Attentions, which do not belong to specific command and so periph. As I can see, devctl should already receive device name to differentiate pass-through requests. I wonder it is could report some process identification too?

This revision is now accepted and ready to land.Jul 25 2023, 12:48 AM
In D41167#937442, @mav wrote:

My position traditionally was that user-space should handle and report its own errors by itself, but kernel should handle some system-wide. I agree that there may be some merits in this. Aside of reporting to devctl this should also properly re-broadcast Unit Attentions, which do not belong to specific command and so periph. As I can see, devctl should already receive device name to differentiate pass-through requests. I wonder it is could report some process identification too?

So, I have a plan to log async calls, which should get the unit attention if I understand correctly (which I may not: this is one area of CAM I'm not confident I understand). devctl message does contain passXX for these errors. We don't actually do any error recovery or take other actions (as far as I can tell), but rely on the flags we pass in to never retry, but to report appropriate errors. That set of issues will wait for another day.

As for reporting the process... I can do that for passXX requests, because passerror is called from the context that scheduled the I/O, so currproc can get proc information. xpt_schedule() for the general case isn't so lucky... It's a good idea, but might sometimes report garbage...

I think this will be ok. It will give at least some avenue to report errors that go up via the pass(4) driver without additional console spamming.

In D41167#937442, @mav wrote:

My position traditionally was that user-space should handle and report its own errors by itself, but kernel should handle some system-wide. I agree that there may be some merits in this. Aside of reporting to devctl this should also properly re-broadcast Unit Attentions, which do not belong to specific command and so periph. As I can see, devctl should already receive device name to differentiate pass-through requests. I wonder it is could report some process identification too?

I agree that userspace programs should generally report their own errors. At least in this case the report isn't going to the console.

As for looking for Unit Attentions and notifying other peripherals... If someone is accessing a device via the pass(4) interface, hopefully they are handling the unit attentions appropriately and we don't necessarily need to notify the "primary" peripheral driver about it. When considering doing that, it might help to understand a particular use-case for it.

In D41167#937737, @ken wrote:

As for looking for Unit Attentions and notifying other peripherals... If someone is accessing a device via the pass(4) interface, hopefully they are handling the unit attentions appropriately and we don't necessarily need to notify the "primary" peripheral driver about it. When considering doing that, it might help to understand a particular use-case for it.

The only way to notify the primary driver about the Unit Attention is to go through cam_periph_error(), as this patch does. I did not mean it to be a problem, on a contrary, it is a good feature.

This revision was automatically updated to reflect the committed changes.