Paths

Table of Contentst

bhyve ahci: Improve robustness of TRIM handling
ClosedPublic
Actions

Authored by jhb on Oct 21 2024, 4:09 PM.

Details

Reviewers

emaste
jrm
markj
khorben_defora.org
mav

Group Reviewers

bhyve

Commits

rG2be68ecff81b: bhyve ahci: Improve robustness of TRIM handling
rG3981cf108773: bhyve ahci: Improve robustness of TRIM handling
rG8c8ebbb04518: bhyve ahci: Improve robustness of TRIM handling

Summary

The previous fix for a stack buffer leak in the ahci device model
actually broke the handling of TRIM as one of the checks it added
caused TRIM commands to never be completed. This resulted in command
timeouts if a guest OS did a 'newfs -E' of an AHCI disk, for example.
Also, for the invalid case the previous check was handling, the device
model should be failing with an error rather than claiming success.

To resolve this, validate the length of a TRIM request and fail with
an error if it exceeds the maximum number of supported blocks
advertised via IDENTIFY. In addition, if the PRDT does not provide
enough data, fail the command with an error rather than performing a
partial completion.

This is somewhat complicated by the implementation of TRIM in the ahci
device model. A single TRIM request can specify multiple LBA ranges.
The device model handles this by dispatching blockif_delete() requests
one at a time. When a blockif_delete() request completes, the device
model locates the TRIM buffer and searches for the next LBA range to
handle. Previously, the device model would re-read the trim buffer
from guest memory each time. However, this was subject to some
unpleasant races if the guest changed the PRDT entries or CFIS while a
command was in flight. Instead, read the buffer of trim ranges once
and cache it across multipe internal blockif requests.

Fixes: 71fa171c6480 bhyve: Initialize stack buffer in pci_ahci
Sponsored by: The FreeBSD Foundation

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jhb created this revision.Oct 21 2024, 4:09 PM

Herald added a reviewer: bhyve. · View Herald TranscriptOct 21 2024, 4:09 PM

Herald added subscribers: bcran, rgrimes, imp. · View Herald Transcript

jhb requested review of this revision.Oct 21 2024, 4:09 PM

Harbormaster completed remote builds in B60125: Diff 145297.Oct 21 2024, 4:09 PM

Note that 71fa171c6480 has not been MFC'd due to these outstanding issues.

This fixes a regression from the previous fix. With current main if you just boot a VM with an AHCI attached disk backed by a zvol (so supports TRIM) and do newfs -E /dev/ada0 the guest FreeBSD kernel hangs in a loop of AHCI timeouts as mav@ worried in the previous review. I hadn't expected that the previous review was actually broken, but my guess is the added done >= sizeof(buf) - 8 check was wrong. It probably should have been '>' instead as if you get a full 512 byte block, it will break from the loop before the last valid entry and never send a reply leading to the hang.

That said, while I could have added validation on each re-read of the CFIS and PRDT after each blockif_delete(), I prefer to avoid the weird TOCTOU-style races and just read it once and validate it once. This does fix the timeouts I see with newfs on main. I have not tested the reported bug though (I was hoping Pierre might have a reproducer guest image he can test against this?)

This approach does also make it easier if we wanted to support multi-block TRIM buffers btw. We probably should have a #define for the number of blocks and the length check I've added should be against that value (maybe before multiplying by 512?) and that value should be what we return in IDENTIFY. However, you could then just change that one knob to the desired number of blocks to support. I'm not sure it matters though? My guess is 1 block is enough for typical workloads?

There are also various races (I think) with the CIFS being changed by the guest while a request is in flight. We really should be caching the CIFS for the duration of a command. The issue there though is that CIFS is variable-sized. :( We could at least cache the common header though I think which would probably handle all of the races I can see.

usr.sbin/bhyve/pci_ahci.c
864	This being conditional in the old code did not make sense to me. I suspect it was a bug in the old code (not related to the SA) but you would only hit if you had a TRIM buffer that was completely empty (all lengths zero).

mav added inline comments.Oct 21 2024, 4:43 PM

usr.sbin/bhyve/pci_ahci.c
864	I don't remember what I was thinking back then, but looking on it now it seems to break recursion of ahci_handle_port() -> ahci_handle_slot() -> ahci_handle_cmd() -> ahci_handle_dsm_trim() -> ahci_handle_port().

jhb added inline comments.Oct 21 2024, 5:24 PM

usr.sbin/bhyve/pci_ahci.c
864	Hmmm, ok. So I should put it back then I guess.
935	Does that mean I should not call this here? This is always "first".

mav added inline comments.Oct 21 2024, 5:52 PM

usr.sbin/bhyve/pci_ahci.c
935	I think so. And not only ahci_handle_port(), but I suppose previous two lines also, since the command was never marked pending.

Correct synchronous command completion handling

Harbormaster completed remote builds in B60131: Diff 145310.Oct 21 2024, 7:00 PM

jhb marked 2 inline comments as done.Oct 21 2024, 7:25 PM

@mav does this version look ok? It still works for me with the basic 'newfs -E' test in a VM.

emaste added inline comments.Oct 23 2024, 2:11 PM

usr.sbin/bhyve/pci_ahci.c
877	Maybe a KASSERT to document that it must be `ATA_SEND_FPDMA_QUEUED`?

Looks good to me. Thanks.

This revision is now accepted and ready to land.Oct 23 2024, 2:20 PM

jhb added inline comments.Oct 24 2024, 2:00 PM

usr.sbin/bhyve/pci_ahci.c
877	Such an assertion can fail if the guest modifies the CFIS while the command is in-progress. If we care about those races then we need a separate change to read and cache the CFIS at the start of command processing and free it after the command completes. Note that if the ncq flag is "wrong" we don't crash, we just write a different result into the FIS. This might confuse the guest, but it shouldn't impact the hypervisor.

Closed by commit rG8c8ebbb04518: bhyve ahci: Improve robustness of TRIM handling (authored by jhb). · Explain WhyOct 24 2024, 2:19 PM

This revision was automatically updated to reflect the committed changes.

jhb added a commit: rG8c8ebbb04518: bhyve ahci: Improve robustness of TRIM handling.

emaste added inline comments.Oct 24 2024, 3:48 PM

usr.sbin/bhyve/pci_ahci.c
877	Would `else if (cfis[2] == ATA_SEND_FPDMA_QUEUED)` make sense?

jhb added inline comments.Oct 25 2024, 1:18 PM

usr.sbin/bhyve/pci_ahci.c
877	But then what do you do in the third case? Especially given that this is in the continuation phase where we have already emitted at least one trim. Also, there are many other places that read CIFS multiple times in this device model. If we do care about such races, we will need to cache the CIFS instead of fixing all these places to fail with errors if the CIFS changed.

imp added inline comments.Oct 25 2024, 1:51 PM

usr.sbin/bhyve/pci_ahci.c
877	Since there's only 32 cfis, and since they are small, it would be better to allocate them into a slot (like real hardware does) and pass that around instead of guest memory. It would be a better emulation of the DMA that's done, since the drive sees only one version of the CFIS, and it's undefined what happens if you change the CFIS after submitting the command. I'd also be tempted to say `ncq = (cfis[2] == ATA_SEND_FPDMA_QUEUED)` instead, so we only do ncq completion processing on the relatively rare ncq trim command (though we could avoid this whole mess by not advertising ncq trim support, but that would pessimize some applications that don't want to pay the queueing penalty on latency and the avoided mess is small).

jhb added inline comments.Oct 25 2024, 4:25 PM

usr.sbin/bhyve/pci_ahci.c
877	I'm happy to fix the model to cache the CIFS, that's just an orthogonal change and isn't TRIM specific. The main thing is I didn't read the SATA (or is it ATA?, I had to look at three different specs to try to understand AHCI) spec closely enough to determine what the upper bound on the CIFS size is. We can easily malloc a copy of it that we pass around, though we also need the original address still so that code can read the PRDT for commands that use it. Currently they just read from `cifs + 0x80`.