Page MenuHomeFreeBSD

ata_da: add quirk to disable NCQ TRIM for Samsung 860/870 SSDs
ClosedPublic

Authored by avg on Feb 18 2024, 3:47 PM.
Tags
None
Referenced Files
F107006420: D43961.diff
Wed, Jan 8, 9:02 PM
Unknown Object (File)
Mon, Jan 6, 1:55 PM
Unknown Object (File)
Dec 9 2024, 12:25 AM
Unknown Object (File)
Nov 16 2024, 10:09 PM
Unknown Object (File)
Nov 16 2024, 12:26 AM
Unknown Object (File)
Nov 12 2024, 2:55 AM
Unknown Object (File)
Nov 8 2024, 12:59 PM
Unknown Object (File)
Nov 8 2024, 12:05 PM
Subscribers
None

Details

Summary

NCQ TRIM for Samsung 860/870 SSDs results in data corruption on systems
with AMD SATA controllers (embedded into a chipset or a central
processor).

This can be easily reproduced using ZFS which uses TRIM and is able to
detect block content changes.

Linux bug report for this issue:

https://bugzilla.kernel.org/show_bug.cgi?id=201693

Since at present we can not limit a quirk based on the contorller / SIM,
apply the quirk in all cases.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 56093
Build 52982: arc lint + arc unit

Event Timeline

avg requested review of this revision.Feb 18 2024, 3:47 PM

I first-hand experienced problems with a device identified as Samsung SSD 870 EVO 250GB SVT02B6Q.
I inferred that 860 is also affected based on the Linux bug report and on a similar problem with 850 as well.

I should clarify that I didn't see just silent data corruption, there were read and write errors as well.

Jul 12 11:32:19 trant kernel: ada4 at ahcich8 bus 0 scbus8 target 0 lun 0
Jul 12 11:32:19 trant kernel: ada4: <Samsung SSD 870 EVO 250GB SVT02B6Q> ACS-4 ATA SATA 3.x device
Jul 12 11:32:19 trant kernel: ada4: Serial Number S6PENU0W400952M
Jul 12 11:32:19 trant kernel: ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
Jul 12 11:32:19 trant kernel: ada4: Command Queueing enabled
Jul 12 11:32:19 trant kernel: ada4: 238475MB (488397168 512 byte sectors)
Jul 12 11:41:20 trant kernel: (ada4:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 88 0d 9c 40 0c 00 00 00 00 00
Jul 12 11:41:20 trant kernel: (ada4:ahcich8:0:0:0): CAM status: Uncorrectable parity/CRC error
Jul 12 11:41:20 trant kernel: (ada4:ahcich8:0:0:0): Error 5, Retries exhausted
Jul 12 11:42:10 trant kernel: (ada4:ahcich8:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 07 40 40 01 00 00 00 00 00
Jul 12 11:42:10 trant kernel: (ada4:ahcich8:0:0:0): CAM status: Uncorrectable parity/CRC error
Jul 12 11:42:10 trant kernel: (ada4:ahcich8:0:0:0): Retrying command, 1 more tries remain

I've been looking at bringing 8n the list from Linux that exploded based on this.

In D43961#1002843, @avg wrote:

I should clarify that I didn't see just silent data corruption, there were read and write errors as well.

Jul 12 11:32:19 trant kernel: ada4 at ahcich8 bus 0 scbus8 target 0 lun 0
Jul 12 11:32:19 trant kernel: ada4: <Samsung SSD 870 EVO 250GB SVT02B6Q> ACS-4 ATA SATA 3.x device
Jul 12 11:32:19 trant kernel: ada4: Serial Number S6PENU0W400952M
Jul 12 11:32:19 trant kernel: ada4: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
Jul 12 11:32:19 trant kernel: ada4: Command Queueing enabled
Jul 12 11:32:19 trant kernel: ada4: 238475MB (488397168 512 byte sectors)
Jul 12 11:41:20 trant kernel: (ada4:ahcich8:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 88 0d 9c 40 0c 00 00 00 00 00
Jul 12 11:41:20 trant kernel: (ada4:ahcich8:0:0:0): CAM status: Uncorrectable parity/CRC error
Jul 12 11:41:20 trant kernel: (ada4:ahcich8:0:0:0): Error 5, Retries exhausted
Jul 12 11:42:10 trant kernel: (ada4:ahcich8:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 07 40 40 01 00 00 00 00 00
Jul 12 11:42:10 trant kernel: (ada4:ahcich8:0:0:0): CAM status: Uncorrectable parity/CRC error
Jul 12 11:42:10 trant kernel: (ada4:ahcich8:0:0:0): Retrying command, 1 more tries remain

These errors are not drive errors, but cable errors. Uncorrectable parity/CRC errors aren't the drive telling you that the data on it is bad (though it may well be), but it is the controller telling you that there were errors transferring the data. You may get a reprieve with these changes, or you may find it was a false positive and these problems persist.

I've been looking to bring over the entire list from Linux. At the time I originally did this feature, and for a year or two after, I kept it up to date. Then there were not updates to Linux's list for a while so I stopped, but it's clear there's about a dozen drives that should be black listed....

In general, I don't think problems like this are specific to a drive / controller combo, especially with the controller is the most common controller out there. If you read the entire bug report, this seems to be related to firmware version and many reports don't include the controller. So I'd treat the 'it is confined to one type of controller' with a fair amount of skepticism. It's more likely a bad firmware version... however, since we don't have good data on that. globally is fine.

Now, having said all that, these drives have a hit or miss hiss in general (there's a long history of a few people having issues that couldn't quite be nailed down and never quite attributed to a known cause).

This revision is now accepted and ready to land.Feb 18 2024, 4:33 PM

These aren't listed in Linux but I'm OK with them here

Just want to comment that based on the errors I initially thought that the problem was a bad cable or SATA port, so I tried a few cables and a few ports on the motherboard, all with the same result.
Then I searched the internet and found some reports for the SSD model.
Then I tried attaching the SSD to an LSI SAS 2308 controller and all problem immediately went away.
Then I added the quirk and the problem got fixed for connecting to on-board SATA as well.

I went through the Linux bug report again and I see that people reported the same or similar issue with some Marvell and ASMedia controllers.

Also, I'd be happier if we could also create a pim value known bad controllers could set. I strongly doubt these are the only drives affected if the controller matters. Can you report pciconf -lv for your controller?

The controller is

ahci2@pci0:0:17:0:      class=0x010601 rev=0x40 hdr=0x00 vendor=0x1002 device=0x4391 subvendor=0x1043 subdevice=0x84dd
    vendor     = 'Advanced Micro Devices, Inc. [AMD/ATI]'
    device     = 'SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]'
    class      = mass storage
    subclass   = SATA
In D43961#1002853, @imp wrote:

These aren't listed in Linux but I'm OK with them here

I see them in drivers/ata/libata-core.c

{ "Samsung SSD 860*",           NULL,   ATA_HORKAGE_NO_NCQ_TRIM |
                                        ATA_HORKAGE_ZERO_AFTER_TRIM |
                                        ATA_HORKAGE_NO_NCQ_ON_ATI },
{ "Samsung SSD 870*",           NULL,   ATA_HORKAGE_NO_NCQ_TRIM |
                                        ATA_HORKAGE_ZERO_AFTER_TRIM |
                                        ATA_HORKAGE_NO_NCQ_ON_ATI }

Also, it might be better to fix the 850 to match the pattern 8[567]0* as well rather than introduce additional entries... Though Samsung shifted their marketing name to PM8xx (and some of those drives are known bad and in the list).

I have this in one of my trees from the last time I did a linux sync (I missed one), with the above suggestion

iff --git a/sys/cam/ata/ata_da.c b/sys/cam/ata/ata_da.c
index de8563a6c229..efdd27c615fa 100644
--- a/sys/cam/ata/ata_da.c
+++ b/sys/cam/ata/ata_da.c
@@ -498,10 +498,10 @@ static struct ada_quirk_entry ada_quirk_table[] =
        },
        {
                /*
-            * Crucial M550 SSDs
+          * Crucial M5[15]0 SSDs
                 * NCQ Trim doesn't work, but only on MU01 firmware
                 */
-           { T_DIRECT, SIP_MEDIA_FIXED, "*", "Crucial CT*M550*", "MU01" },
+         { T_DIRECT, SIP_MEDIA_FIXED, "*", "Crucial CT*M5[15]0*", "MU01" },
                /*quirks*/ADA_Q_NCQ_TRIM_BROKEN
        },
        {
@@ -722,10 +722,10 @@ static struct ada_quirk_entry ada_quirk_table[] =
        },
        {
                /*
-            * Samsung 850 SSDs
+          * Samsung 850 SSDs, 860 and 870s also implicated, so be conservative
                 * 4k optimised, NCQ TRIM broken (normal TRIM fine)
                 */
-           { T_DIRECT, SIP_MEDIA_FIXED, "*", "Samsung SSD 850*", "*" },
+         { T_DIRECT, SIP_MEDIA_FIXED, "*", "Samsung SSD 8[567]0*", "*" },
                /*quirks*/ADA_Q_4K | ADA_Q_NCQ_TRIM_BROKEN
        },
        {

Based on earlier entries for Samsung 8x0 drives in ata_da.c and scsi_da.c it seems that a separate entry per model line was a preference.
But combing multiple entries is also okay.
However, I'd like to do it as a separate commit a bit later, unless you want to do that yourself.

This revision now requires review to proceed.Feb 19 2024, 10:03 AM
This revision was not accepted when it landed; it landed in state Needs Review.Feb 19 2024, 10:09 AM
This revision was automatically updated to reflect the committed changes.

Yea... lsi does no ncq trim. I'd bet higher money it's disabled in the controller on linux... I'd like to do similar on FreeBSD. I'd be super surprised I if it were only the drive combo...

In D43961#1002858, @avg wrote:
In D43961#1002853, @imp wrote:

These aren't listed in Linux but I'm OK with them here

I see them in drivers/ata/libata-core.c

{ "Samsung SSD 860*",           NULL,   ATA_HORKAGE_NO_NCQ_TRIM |
                                        ATA_HORKAGE_ZERO_AFTER_TRIM |
                                        ATA_HORKAGE_NO_NCQ_ON_ATI },
{ "Samsung SSD 870*",           NULL,   ATA_HORKAGE_NO_NCQ_TRIM |
                                        ATA_HORKAGE_ZERO_AFTER_TRIM |
                                        ATA_HORKAGE_NO_NCQ_ON_ATI }

Hmm that must have been added after my last git update... I'll have to see what else is there..,