Fix mpr(4) and mps(4) state transitions and a use-after-free panic. When the mpr(4) and mps(4) drivers probe a SATA device, they issue an ATA Identify command (via mp{s,r}sas_get_sata_identify()) before the target is fully setup in the driver. The drivers wait for completion of the identify command, and have a 5 second timeout. If the timeout fires, the command is marked with the SATA_ID_TIMEOUT flag so it can be freed later. That is where the use-after-free problem comes in. Once the ATA Identify times out, the driver sends a target reset, and then frees any identify commands that have timed out. But, once the target reset completes, commands that were queued to the drive are returned to the driver by the controller. At that point, the driver (in mp{s,r}_intr_locked()) looks up the command descriptor for that particular SMID, marks it CM_STATE_BUSY and sends it on for completion handling. The problem at this stage is that the command has already been freed, and put on the free queue, so its state is CM_STATE_FREE. If INVARIANTS are turned on, we get a panic as soon as this command is allocated, because its state is no longer CM_STATE_FREE, but rather CM_STATE_BUSY. So, the solution is to not free ATA Identify commands that get stuck until they actually return from the controller. Hopefully this works correctly on older firmware versions. If not, it could result in commands hanging around indefinitely. But, the alternative is a use-after-free panic or assertion (in the INVARIANTS case). This also tightens up the state transitions between CM_STATE_FREE, CM_STATE_BUSY and CM_STATE_INQUEUE, so that the state transitions happen once, and we have assertions to make sure that commands are in the correct state before transitioning to the next state. Also, for each state assertion, we print out the current state of the command if it is incorrect. mp{s,r}.c: Add a new sysctl variable, dump_reqs_alltypes, that controls the behavior of the dump_reqs sysctl. If dump_reqs_alltypes is non-zero, it will dump all commands, not just the commands that are in the CM_STATE_INQUEUE state. (You can see the commands that are in the queue by using mp{s,r}util debug dumpreqs.) Make sure that the INQUEUE -> BUSY state transition happens in one place, the mp{s,r}_complete_command routine. mp{s,r}_sas.c: Make sure we print the current command type in command state assertions. mp{s,r}_sas_lsi.c: Add a new completion handler, mp{s,r}sas_ata_id_complete. This completion handler will free data allocated for an ATA Identify command and free the command structure. In mp{s,r}_ata_id_timeout, do not set the command state to CM_STATE_BUSY. The command is still in queue in the controller. Since we were blocking waiting for this command to complete, there was no completion handler previously. Set the completion handler, so that whenever the command does come back, it will get freed properly. Do not free ATA Identify commands that have timed out in mp{s,r}sas_add_device(). Wait for them to actually come back from the controller. mp{s,r}var.h: Add a dump_reqs_alltypes variable for the new dump_reqs_alltypes sysctl. Make sure we print the current state for state transition asserts.
Details
To best test this, you need a machine with a SAS expander, and SATA drives attached to the expander. The more drives the better.
Build a kernel with INVARIANTS, and then:
You will be cycling the PHYs for the drives in question on the expander. You can script it, for example:
for i in 10 11 12 13 14 15 16 17 18 19; do while true; do camcontrol smppc ses0 -p $i -o disable; sleep 10; camcontrol smpc ses0 -p $i -o linkreset; sleep 10; done & done
If you run for 24 hours without an assert, you can probably call this fixed.
Once you're done with the test, kill all of the jobs, and then reset the PHYs so that the drives come back. For example:
for i in 10 11 12 13 14 15 16 17 18 19; do camcontrol smppc ses0 -p $i -o linkreset; done
Once all of the drives have come back successfully (make sure that they all come back!), make sure there are not any stuck commands.
To do that:
sysctl dev.mps.0.dump_reqs_alltypes=1
(Use "mpr" for 12Gb controllers, and do that for each controller instance number, 0, 1, 2, etc.)
Then, using either mpsutil or mprutil (6Gb or 12Gb), do this:
mpsutil debug dumpreqs |grep SMID |awk '{if ($4 != 0){print $0}}'
That will show commands that are not free. If you have I/O active, you may see commands that are busy (state 1) and in queue (state 2), but they should be transient. If they're stuck, then we may have a bug.
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 31992
Event Timeline
I only looked at mpr, but this looks good to me. It's good you have a way to recreate it. I have random machines failing with this (about ~0.001%/day), but can't find one to recreate it. I like this approach, and is similar to the one I took with other commands and target reset. thanks for fixing.
I'm deploying this to one or two of the machines that we see panics from this every few days.
I've had good luck running this for 2 or 3 firmware cycles now at Netflix... We're down 3/4 in panics, and at least part of the remaining 1/4 appear to be due to a small mismerge of two files so they were out of sync with the rest of the state machine....
what are the plans to commit this?
sys/dev/mpr/mpr.c | ||
---|---|---|
2345 | Actually, I spoke too soon. I hadn't completely audited the diffs from head and came to a false 'this was a due to a mismerge' conclusion. Turns out there's a case this assert catches that's legit (though maybe the place to fix it is in the caller). So, if we can't map in the chain frames properly in mp[rs]_data_cb, then we call mp[rs]_command_complete before we've had a chance to call mp[rs]_enqueue_request which sets the state to INQUEUE. So, since the state is still BUSY, we panic here every time we have a resource shortage that leads to no chains being available. It's rare (1 maybe 2 a week in the Netflix fleet of thousands), but it can happen. Since this is the only other caller than mp[rs]_inter_locked, maybe the best thing to do is to set the state to INQUEUE in the *data_cb routine in the out of chain frames error path instead... |