Details

Reviewers

fuz
emaste
andrew

Summary

This changeset includes a port of the SIMD implementation of memcmp
for amd64 to Aarch64.

It also solves an issue with the existing implementation for
Aarch64 where the return value is not in accordance with the
man page and only returns -1,0 or 1 instead of the byte difference.

Performance is better than the existing memcmp implementation
borrowed from the Arm Optimized Routines except for long strings.

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
            │  memcmpARM  │             memcmpSIMD              │
            │   sec/op    │   sec/op     vs base                │
MemcmpShort   63.96µ ± 1%   32.41µ ± 0%  -49.33% (p=0.000 n=20)
MemcmpMid     12.09µ ± 1%   12.33µ ± 1%   +1.98% (p=0.000 n=20)
MemcmpLong    4.648µ ± 1%   4.942µ ± 1%   +6.32% (p=0.000 n=20)
geomean       15.32µ        12.55µ       -18.10%

            │  memcmpARM   │              memcmpSIMD              │
            │     B/s      │     B/s       vs base                │
MemcmpShort   1.820Gi ± 1%   3.592Gi ± 0%  +97.35% (p=0.000 n=20)
MemcmpMid     9.629Gi ± 1%   9.442Gi ± 1%   -1.94% (p=0.000 n=20)
MemcmpLong    25.05Gi ± 1%   23.55Gi ± 1%   -5.96% (p=0.000 n=20)
geomean       7.600Gi        9.279Gi       +22.09%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
            │  memcmpARM   │             memcmpSIMD              │
            │    sec/op    │   sec/op     vs base                │
MemcmpShort   136.11µ ± 3%   69.69µ ± 0%  -48.80% (p=0.000 n=20)
MemcmpMid      34.16µ ± 1%   32.55µ ± 1%   -4.71% (p=0.000 n=20)
MemcmpLong     9.382µ ± 0%   8.972µ ± 1%   -4.37% (p=0.000 n=20)
geomean        35.20µ        27.30µ       -22.44%

            │  memcmpARM   │              memcmpSIMD               │
            │     B/s      │      B/s       vs base                │
MemcmpShort   875.8Mi ± 3%   1710.6Mi ± 0%  +95.31% (p=0.000 n=20)
MemcmpMid     3.408Gi ± 1%    3.577Gi ± 1%   +4.94% (p=0.000 n=20)
MemcmpLong    12.41Gi ± 0%    12.98Gi ± 1%   +4.57% (p=0.000 n=20)
geomean       3.307Gi         4.264Gi       +28.93%

Test Plan

Passes the tests in the test suite

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 58910
Build 55797: arc lint + arc unit

Event Timeline

getz created this revision.Jun 17 2024, 3:26 PM

Herald added a reviewer: andrew. · View Herald TranscriptJun 17 2024, 3:26 PM

Herald added a subscriber: imp. · View Herald Transcript

getz requested review of this revision.Jun 17 2024, 3:26 PM

Harbormaster completed remote builds in B58207: Diff 139930.Jun 17 2024, 3:26 PM

getz added inline comments.Jun 17 2024, 3:31 PM

lib/libc/aarch64/string/memcmp.S
32	My idea here is that if a buffer is placed just at a boundary, jump, then edit the pointer to load -32 bytes from the end and shift the vector register using TBL a dynamic amount to get everything in the right position and compare as usual.
91	It might be worth messing with the loads before to make this aligned, otherwise this could also cross into an unmapped page I believe.

getz edited the summary of this revision. (Show Details)Jun 17 2024, 3:44 PM

getz added inline comments.Jun 17 2024, 3:46 PM

lib/libc/aarch64/string/memcmp.S
96	I'm using UMAXP here instead of NOT'ing the output of CMEQ and doing a FCMP with #0.0 as this had better performance in general.

Can you test this with the two buffers at the end of a page and the next page is unmapped. I'm seeing segmentation faults when I test it.

I'm working on solving that right now, I mentioned it in the description.
The problem is that a buffer less than 32 bytes can be placed at the end of a page causing an overread.
I'll update it later today. I was mostly looking for some feedback at first.

Avoid overreads in main loop

We re-compare some of the bytes we already compared to avoid crossing
over into an unmapped page.

Still lacking handling for the "short input" case that can still cross
over into unmapped pages.

Harbormaster completed remote builds in B58219: Diff 139953.Jun 18 2024, 1:45 PM

getz marked an inline comment as not done.Jun 18 2024, 1:46 PM

Some initial comments.

Stylistic notes: a space goes after the commas separating instruction operands.

lib/libc/aarch64/string/memcmp.S
36	Seems like the comment is the wrong way round?
52–55	Rule of thumb: try to interleave independent instructions so multi-issue in order CPUs can execute them in parallel.
60–71	This should be done in a branch-free manner.
104–108	This code always sets `x2` to `32`. We can treat the limit as constant from here on. The following two subtractions set `x8` to `x8 + x2 - 32` and `x9` to `x9 + x2 - 32`. You then add `32` to that for the load. Why all this complication? Why not just add and then load?
123	This too should be all branch free.

Address comments

interleaves independent instructions for paralellism.
finds the offset in a branch free manner after a match.
simplifies the code for avoiding overreads in the main loop

Harbormaster completed remote builds in B58232: Diff 139979.Jun 19 2024, 1:08 PM

Avoid page crossing into unmapped pages

Passes all tests, need to extend memcmp tests to verify correct
behaviour for small buffers at page boundaries.

Harbormaster completed remote builds in B58296: Diff 140093.Jun 21 2024, 7:33 PM

getz edited the summary of this revision. (Show Details)Jun 21 2024, 7:53 PM

getz edited the test plan for this revision. (Show Details)

getz edited the summary of this revision. (Show Details)Jun 24 2024, 10:07 AM

getz edited the summary of this revision. (Show Details)Jun 24 2024, 3:44 PM

fuz added inline comments.Jul 8 2024, 10:50 AM

lib/libc/aarch64/string/memcmp.S
46	Here too try to load the shift table address only once.
187–188	This table needs to be padded and `.size` added in much the same way as I recommend for `strcmp`.

Pad .rodata with -1 and use unsigned comparisons

Harbormaster completed remote builds in B58910: Diff 141629.Aug 1 2024, 10:05 AM

Use correct style for SPDX identifier

Harbormaster completed remote builds in B58911: Diff 141630.Aug 1 2024, 10:07 AM

follow __$FUNC convention

Harbormaster completed remote builds in B59016: Diff 142075.Aug 14 2024, 12:11 PM

Follow __$FUNC convention
ifdef's to simulate bcmp

Harbormaster completed remote builds in B59017: Diff 142078.Aug 14 2024, 1:47 PM

Use unsigned comparisons for main loop

Also unrolls the main loop which resulted in a modest performance
increase:

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
            │  memcmpARM   │             memcmpSIMD              │
            │    sec/op    │   sec/op     vs base                │
MemcmpShort    64.55µ ± 1%   32.55µ ± 1%  -49.57% (p=0.000 n=20)
MemcmpMid      12.47µ ± 1%   12.36µ ± 1%        ~ (p=0.121 n=20)
MemcmpLong     4.812µ ± 2%   4.930µ ± 0%   +2.46% (p=0.000 n=20)
BcmpShort     130.97µ ± 1%   32.24µ ± 1%  -75.38% (p=0.000 n=20)
BcmpMid        75.37µ ± 2%   12.42µ ± 1%  -83.52% (p=0.000 n=20)
BcmpLong      67.378µ ± 4%   4.926µ ± 0%  -92.69% (p=0.000 n=20)
geomean        37.02µ        12.55µ       -66.10%

            │  memcmpARM  │               memcmpSIMD               │
            │    MiB/s    │    MiB/s      vs base                  │
MemcmpShort   1.936k ± 1%    3.840k ± 1%    +98.29% (p=0.000 n=20)
MemcmpMid     10.02k ± 1%    10.12k ± 1%          ~ (p=0.121 n=20)
MemcmpLong    25.98k ± 2%    25.36k ± 0%     -2.40% (p=0.000 n=20)
BcmpShort      954.4 ± 1%    3877.4 ± 1%   +306.25% (p=0.000 n=20)
BcmpMid       1.659k ± 2%   10.066k ± 1%   +506.85% (p=0.000 n=20)
BcmpLong      1.855k ± 4%   25.377k ± 0%  +1267.86% (p=0.000 n=20)
geomean       3.376k         9.959k        +194.96%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
            │   memcmpARM   │             memcmpSIMD              │
            │    sec/op     │   sec/op     vs base                │
MemcmpShort    133.92µ ± 0%   70.15µ ± 0%  -47.62% (p=0.000 n=20)
MemcmpMid       31.63µ ± 1%   31.82µ ± 1%   +0.59% (p=0.006 n=20)
MemcmpLong      9.384µ ± 0%   9.058µ ± 1%   -3.48% (p=0.000 n=20)
BcmpShort      247.46µ ± 2%   70.17µ ± 0%  -71.64% (p=0.000 n=20)
BcmpMid        148.94µ ± 1%   31.93µ ± 1%  -78.56% (p=0.000 n=20)
BcmpLong      126.093µ ± 0%   8.949µ ± 2%  -92.90% (p=0.000 n=20)
geomean         75.47µ        27.20µ       -63.95%

            │  memcmpARM   │                memcmpSIMD                │
            │     B/s      │      B/s        vs base                  │
MemcmpShort   890.1Mi ± 0%    1699.3Mi ± 0%    +90.91% (p=0.000 n=20)
MemcmpMid     3.680Gi ± 1%     3.659Gi ± 1%     -0.59% (p=0.006 n=20)
MemcmpLong    12.41Gi ± 0%     12.85Gi ± 1%     +3.60% (p=0.000 n=20)
BcmpShort     481.9Mi ± 2%    1698.9Mi ± 0%   +252.56% (p=0.000 n=20)
BcmpMid       800.4Mi ± 1%    3733.5Mi ± 1%   +366.47% (p=0.000 n=20)
BcmpLong      945.4Mi ± 0%   13320.9Mi ± 2%  +1309.01% (p=0.000 n=20)
geomean       1.543Gi          4.279Gi        +177.40%

Harbormaster completed remote builds in B59174: Diff 142419.Aug 25 2024, 12:41 PM

exp-run says it's fine.

This revision is now accepted and ready to land.Nov 6 2024, 2:26 PM

Have you tried any other benchmarks? I tried testing against the glibc memcmp using their benchmarking tool as should be the same code as on FreeBSD and found this version was slower in most cases.

Have you tried any other benchmarks? I tried testing against the glibc memcmp using their benchmarking tool as should be the same code as on FreeBSD and found this version was slower in most cases.

What CPU model did you test on? I am not sure exactly how the glibc benchmarks work and don't run Linux, so it's hard to say for me what the cause of this is.
If you run Linux on some ARM box, check if you can run our strperf benchmark there and see what results you get: https://github.com/clausecker/strperf.

I was running on a Neoverse-N2 CPU. I ran with your benchmark and received mixed results, some faster some slower.

Will hold this one pending the rework @getz has promised.

lib/libc/aarch64/string: add memcmp SIMD implementation
AcceptedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 141629

lib/libc/aarch64/string/Makefile.inc

lib/libc/aarch64/string/memcmp.S

lib/libc/aarch64/string: add memcmp SIMD implementationAcceptedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 141629

lib/libc/aarch64/string/Makefile.inc

lib/libc/aarch64/string/memcmp.S

lib/libc/aarch64/string: add memcmp SIMD implementation
AcceptedPublic
Actions

Revision Contents
Changeset List