ada/da: Ignore CCBs at wrong priority in *start
AbandonedPublic
Actions

Authored by imp on Jul 18 2024, 10:02 PM.

Details

Reviewers

mav
jhb

Group Reviewers

cam

Summary

In tracking down lifecycle issues with da and ada, I noticed we'd get
CCBs of the wrong priority for the state of the probe state machine. In
debugging 6c8ab086fed3, I created this patch, but didn't upstream. I
had thought that it was the only cause of the bad ccbs, but I was
mistaken. We still see this message about twice a month in Netflix's
fleet, though the root cause of 6c8ab086fed3 is now gone (despite
the uncertainty expressed in the log: 1-2 a week before, now 0 in
two years).

One cause can be the dynamic I/O scheduler when we're rate limiting
I/O. We'll call the start routine when a timer expires, but that will
interfere with the state machine.

Another cause of this may be related to the I/O coming in too quickly
while we're recovering the device after a different device fails on
mpr/mps.

So to fail safe, since we have to carefully single-step the queue when
we're running the state machine, only accept CCBs that are at priority
CAM_PRIORITY_DEV when we're doing that. Only accept CCBs at priority
CAM_PRIORITY_NORMAL. I/O that would normally be scheduled is now
deferred (it picks back up again when we enter the normal mode).

Also add a whiny message on the off chance ohters were seeing this
problem to gague the priority of a fix for the underlying issue.

nda has no real discovery state machine that re-runs after I/O
processing starts, so no workaround is needed there.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 58726
Build 55614: arc lint + arc unit

Event Timeline

imp created this revision.Jul 18 2024, 10:02 PM

Herald added a reviewer: cam. · View Herald TranscriptJul 18 2024, 10:02 PM

imp requested review of this revision.Jul 18 2024, 10:02 PM

Harbormaster completed remote builds in B58715: Diff 141096.Jul 18 2024, 10:02 PM

imp added a child revision: D46033: cam/iosched: Add a counter of I/Os that take too long.Jul 18 2024, 10:02 PM

imp removed a child revision: D46033: cam/iosched: Add a counter of I/Os that take too long.Jul 19 2024, 4:20 AM

imp added a parent revision: D46038: cam/iosched: Make each periph driver provide schedule fnp.

Misc updates with testing

Harbormaster completed remote builds in B58726: Diff 141110.Jul 19 2024, 4:23 AM

Note: I think I may hold off committing this one give D46038 would fix all known instances of it. I'll keep this in my tree here at at Netfix to prove it works. If so, I may revise this to be a panic and commit that.... Further testing will tell, since I do not have a good reproducer for this... I only see it sometimes in the fleet when, at least the last few I can look at in detail, we have some error.

I may also want to see why we seem to always call it on the expiration of the quantum... That may be a relatively harmless bug that I'd not considered a bug when I did the investigation into all this stuff a couple of years ago.... These messages almost certainly indicate that this should be viewed as a bug. I'm unsure what I was thinking when I first discovered it and put them in rather than do a fix like D46038. I don't seem to have notes from the time either... :(

I suspect users won't report random printfs, so if you want to commit this upstream it probably needs to be a panic/KASSERT instead so users will notice.

I'll do this as asserts in 6 months or so if the other fixes I just pushed to -current (and Netflix's tree) eliminate all the priority messages in our logs...
Until then abandon this to de-clutter things at least a little