Page MenuHomeFreeBSD

lib/libc/aarch64/string: add strcmp SIMD implementation
ClosedPublic

Authored by getz on Jul 2 2024, 7:04 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 24, 11:24 PM
Unknown Object (File)
Fri, Jan 24, 10:40 PM
Unknown Object (File)
Fri, Jan 24, 10:35 PM
Unknown Object (File)
Fri, Jan 24, 10:15 PM
Unknown Object (File)
Fri, Jan 24, 9:41 PM
Unknown Object (File)
Fri, Jan 24, 5:16 PM
Unknown Object (File)
Fri, Jan 24, 2:26 AM
Unknown Object (File)
Thu, Jan 23, 6:33 PM
Subscribers

Details

Summary

This changeset includes a port of the SIMD implementation of strcmp
for amd64 to Aarch64.

Below is a description of its method as described in D41971

The basic idea is to process the bulk of the string in aligned blocks
of 16 bytes such that one string runs ahead and the other runs behind.
The string that runs ahead is checked for NUL bytes, the one that runs
behind is compared with the corresponding chunk of the string that runs
ahead. This trades an extra load per iteration for the very complicated
block-reassembly needed in the other implementations (bionic, glibc).
On the flip side, we need two code paths depending on the relative
alignment of the two buffers.

The initial part of the string is compared directly if it is known not
to cross a page boundary. Otherwise, a complex slow path to avoid
crossing into unmapped memory commences.

Performance is better in most cases than the existing implementation
from the Arm Optimized Routines repository.

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
                     │  strcmpARM  │             strcmpSIMD              │          strcmpSIMDupdated          │
                     │   sec/op    │   sec/op     vs base                │   sec/op     vs base                │
StrcmpShortAligned     137.6µ ± 1%   113.8µ ± 0%  -17.35% (p=0.000 n=20)   110.0µ ± 0%  -20.11% (p=0.000 n=20)
StrcmpMidAligned       37.54µ ± 2%   38.93µ ± 0%   +3.69% (p=0.000 n=20)   35.85µ ± 1%   -4.50% (p=0.000 n=20)
StrcmpLongAligned      17.65µ ± 0%   14.67µ ± 0%  -16.89% (p=0.000 n=20)   15.09µ ± 2%  -14.51% (p=0.000 n=20)
StrcmpShortUnaligned   183.7µ ± 1%   125.2µ ± 0%  -31.83% (p=0.000 n=20)   122.5µ ± 0%  -33.32% (p=0.000 n=20)
StrcmpMidUnaligned     51.74µ ± 0%   38.69µ ± 2%  -25.23% (p=0.000 n=20)   41.98µ ± 1%  -18.86% (p=0.000 n=20)
StrcmpLongUnaligned    16.20µ ± 0%   16.12µ ± 0%   -0.50% (p=0.000 n=20)   16.51µ ± 0%   +1.89% (p=0.000 n=20)
StrcmpShortQsort       1.511m ± 0%   1.450m ± 0%   -4.05% (p=0.000 n=20)   1.412m ± 0%   -6.55% (p=0.000 n=20)
StrcmpMidQsort         354.1µ ± 0%   345.1µ ± 0%   -2.56% (p=0.000 n=20)   334.8µ ± 0%   -5.46% (p=0.000 n=20)
geomean                96.49µ        84.24µ       -12.69%                  83.60µ       -13.35%

                     │  strcmpARM   │              strcmpSIMD               │           strcmpSIMDupdated           │
                     │     B/s      │      B/s       vs base                │      B/s       vs base                │
StrcmpShortAligned     866.2Mi ± 1%   1048.0Mi ± 0%  +20.99% (p=0.000 n=20)   1084.2Mi ± 0%  +25.17% (p=0.000 n=20)
StrcmpMidAligned       3.101Gi ± 1%    2.991Gi ± 0%   -3.56% (p=0.000 n=20)    3.247Gi ± 1%   +4.71% (p=0.000 n=20)
StrcmpLongAligned      6.597Gi ± 0%    7.938Gi ± 0%  +20.33% (p=0.000 n=20)    7.717Gi ± 2%  +16.98% (p=0.000 n=20)
StrcmpShortUnaligned   649.0Mi ± 1%    952.1Mi ± 0%  +46.70% (p=0.000 n=20)    973.3Mi ± 0%  +49.96% (p=0.000 n=20)
StrcmpMidUnaligned     2.250Gi ± 0%    3.009Gi ± 2%  +33.74% (p=0.000 n=20)    2.773Gi ± 1%  +23.25% (p=0.000 n=20)
StrcmpLongUnaligned    7.186Gi ± 0%    7.222Gi ± 0%   +0.50% (p=0.000 n=20)    7.053Gi ± 0%   -1.85% (p=0.000 n=20)
StrcmpShortQsort       78.89Mi ± 0%    82.22Mi ± 0%   +4.22% (p=0.000 n=20)    84.42Mi ± 0%   +7.01% (p=0.000 n=20)
StrcmpMidQsort         336.6Mi ± 0%    345.5Mi ± 0%   +2.62% (p=0.000 n=20)    356.1Mi ± 0%   +5.77% (p=0.000 n=20)
geomean                1.207Gi         1.382Gi       +14.53%                   1.392Gi       +15.41%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
                     │  strcmpARM   │             strcmpSIMD              │          strcmpSIMDupdated          │
                     │    sec/op    │   sec/op     vs base                │   sec/op     vs base                │
StrcmpShortAligned      85.63µ ± 1%   72.88µ ± 1%  -14.88% (p=0.000 n=20)   67.86µ ± 1%  -20.75% (p=0.000 n=20)
StrcmpMidAligned        21.59µ ± 5%   17.51µ ± 2%  -18.88% (p=0.000 n=20)   16.60µ ± 2%  -23.10% (p=0.000 n=20)
StrcmpLongAligned      13.457µ ± 8%   8.315µ ± 3%  -38.21% (p=0.000 n=20)   8.809µ ± 4%  -34.54% (p=0.000 n=20)
StrcmpShortUnaligned   135.96µ ± 0%   85.68µ ± 0%  -36.98% (p=0.000 n=20)   78.51µ ± 0%  -42.26% (p=0.000 n=20)
StrcmpMidUnaligned      30.31µ ± 1%   18.64µ ± 1%  -38.49% (p=0.000 n=20)   17.37µ ± 1%  -42.70% (p=0.000 n=20)
StrcmpLongUnaligned    13.649µ ± 1%   8.442µ ± 2%  -38.15% (p=0.000 n=20)   8.480µ ± 5%  -37.87% (p=0.000 n=20)
StrcmpShortQsort       1173.0µ ± 0%   984.4µ ± 0%  -16.08% (p=0.000 n=20)   950.8µ ± 0%  -18.94% (p=0.000 n=20)
StrcmpMidQsort          263.1µ ± 0%   227.8µ ± 0%  -13.38% (p=0.000 n=20)   220.6µ ± 0%  -16.14% (p=0.000 n=20)
geomean                 67.51µ        48.79µ       -27.74%                  47.06µ       -30.29%

                     │  strcmpARM   │              strcmpSIMD               │           strcmpSIMDupdated           │
                     │     B/s      │      B/s       vs base                │      B/s       vs base                │
StrcmpShortAligned     1.360Gi ± 1%    1.597Gi ± 1%  +17.49% (p=0.000 n=20)    1.715Gi ± 1%  +26.18% (p=0.000 n=20)
StrcmpMidAligned       5.393Gi ± 5%    6.648Gi ± 2%  +23.28% (p=0.000 n=20)    7.013Gi ± 2%  +30.04% (p=0.000 n=20)
StrcmpLongAligned      8.651Gi ± 9%   14.001Gi ± 3%  +61.83% (p=0.000 n=20)   13.216Gi ± 4%  +52.77% (p=0.000 n=20)
StrcmpShortUnaligned   876.8Mi ± 0%   1391.3Mi ± 0%  +58.69% (p=0.000 n=20)   1518.5Mi ± 0%  +73.19% (p=0.000 n=20)
StrcmpMidUnaligned     3.841Gi ± 1%    6.245Gi ± 1%  +62.58% (p=0.000 n=20)    6.704Gi ± 1%  +74.52% (p=0.000 n=20)
StrcmpLongUnaligned    8.529Gi ± 1%   13.791Gi ± 2%  +61.69% (p=0.000 n=20)   13.729Gi ± 5%  +60.96% (p=0.000 n=20)
StrcmpShortQsort       101.6Mi ± 0%    121.1Mi ± 0%  +19.16% (p=0.000 n=20)    125.4Mi ± 0%  +23.37% (p=0.000 n=20)
StrcmpMidQsort         453.2Mi ± 0%    523.2Mi ± 0%  +15.45% (p=0.000 n=20)    540.4Mi ± 0%  +19.25% (p=0.000 n=20)
geomean                1.724Gi         2.386Gi       +38.39%                   2.474Gi       +43.46%
Test Plan

Passes all current unit tests

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

getz requested review of this revision.Jul 2 2024, 7:04 PM

Impressive performance!

I wonder how the performance would change if you unconditionally did the tbl maneuver to get the heads in place. As you could then interleave the dependency chains for the two strings and get rid of two conditional branches.

lib/libc/aarch64/string/strcmp.S
37–41

Permit more ILP on multi-issue in-order architectures.

54–55

This should be done once at an earlier point and the address you looked up should be used for both cases instead of reloading it.

57

Why do you reload q0 here? It should be more efficient to cmeq into a new register earlier so you keep the string head already loaded.

Some more microöptimisations. Some of them occur more than once in the code but I only marked some of them.

lib/libc/aarch64/string/strcmp.S
87–92
100–105
115–116

This may perform better as cmp + b.cc fuse on some microarchitectures.

187–188
190–191
206–207
356

The .size directive makes objdump output more readable.
Pad the table with 0xff so we never read past its end (this may otherwise crash if our rodata section ends up right at the end of the last mapped page; doing so also makes static checkers happy).

Remove branching for tbl maneuver.
Applies suggested microoptimizations for arithmetic.

getz marked 9 inline comments as done.Jul 9 2024, 4:26 PM
getz edited the summary of this revision. (Show Details)
getz planned changes to this revision.Jul 9 2024, 5:42 PM

Hit the submit button too early, removing the branches for the tbl maneuver didn't work as I had hoped.

  • Revert to branched handling of strings near the end of a page.
  • Use correct offset for the loop in the normal case

Assuming this passes the test suite now, I'd say LGTM.

  • Remove unnecessary mov, from strncmp review
In D45839#1047448, @fuz wrote:

Assuming this passes the test suite now, I'd say LGTM.

Passes all current tests, just need to finish my bounds test for str(n)cmp to be sure.

  • Revert last commit, it wasn't unnecessary :)

Use unsigned comparisons everywhere

follow __$FUNC convention

exp-run says it's fine.

This revision is now accepted and ready to land.Nov 6 2024, 2:26 PM