string: add memccpy SIMD implementation
AcceptedPublic
Actions

Authored by getz on Jul 27 2024, 7:57 PM.

Details

Reviewers

fuz
emaste
andrew

Summary

This changeset includes a port of the SIMD implementation of memccpy
for amd64 to Aarch64.

Performance is significantly better than the scalar implementation
except for short strings.

Benchmark results are as usual generated by the strperf utility written
by fuz.

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ memccpyScalar │             memccpySIMD             │
        │    sec/op     │   sec/op     vs base                │
Short       136.7µ ± 1%   142.4µ ± 0%   +4.11% (p=0.000 n=20)
Mid         69.85µ ± 1%   30.63µ ± 1%  -56.15% (p=0.000 n=20)
Long      112.854µ ± 0%   7.898µ ± 1%  -93.00% (p=0.000 n=20)
geomean     102.5µ        32.53µ       -68.27%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short      871.9Mi ± 1%    837.4Mi ± 0%     -3.95% (p=0.000 n=20)
Mid        1.667Gi ± 1%    3.801Gi ± 1%   +128.04% (p=0.000 n=20)
Long       1.032Gi ± 0%   14.740Gi ± 1%  +1328.86% (p=0.000 n=20)
geomean    1.135Gi         3.578Gi        +215.14%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ memccpyScalar │             memccpySIMD              │
        │    sec/op     │    sec/op     vs base                │
Short       96.73µ ± 1%   122.82µ ± 1%  +26.98% (p=0.000 n=20)
Mid         48.50µ ± 0%    24.62µ ± 0%  -49.23% (p=0.000 n=20)
Long       84.122µ ± 1%    4.961µ ± 0%  -94.10% (p=0.000 n=20)
geomean     73.35µ         24.66µ       -66.37%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short     1232.5Mi ± 1%    970.6Mi ± 1%    -21.25% (p=0.000 n=20)
Mid        2.400Gi ± 0%    4.728Gi ± 0%    +96.95% (p=0.000 n=20)
Long       1.384Gi ± 1%   23.466Gi ± 0%  +1595.65% (p=0.000 n=20)
geomean    1.587Gi         4.720Gi        +197.38%

Test Plan

Passes all the unit tests including the extended memccpy test D46051

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 59005
Build 55892: arc lint + arc unit

Event Timeline

getz created this revision.Jul 27 2024, 7:57 PM

Herald added a reviewer: andrew. · View Herald TranscriptJul 27 2024, 7:57 PM

Herald added a subscriber: imp. · View Herald Transcript

getz requested review of this revision.Jul 27 2024, 7:57 PM

Harbormaster completed remote builds in B58870: Diff 141475.Jul 27 2024, 7:57 PM

will rebase on D46052 to improve performance for the short case

New method for handling short strings based on D46052

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ memccpyScalar │             memccpySIMD             │
        │    sec/op     │   sec/op     vs base                │
Short       136.7µ ± 1%   141.8µ ± 0%   +3.74% (p=0.000 n=20)
Mid         69.85µ ± 1%   31.93µ ± 1%  -54.28% (p=0.000 n=20)
Long      112.854µ ± 0%   7.985µ ± 3%  -92.92% (p=0.000 n=20)
geomean     102.5µ        33.07µ       -67.74%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short      871.9Mi ± 1%    840.5Mi ± 0%     -3.60% (p=0.000 n=20)
Mid        1.667Gi ± 1%    3.646Gi ± 1%   +118.72% (p=0.000 n=20)
Long       1.032Gi ± 0%   14.579Gi ± 3%  +1313.28% (p=0.000 n=20)
geomean    1.135Gi         3.520Gi        +210.02%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ memccpyScalar │             memccpySIMD              │
        │    sec/op     │    sec/op     vs base                │
Short       96.73µ ± 1%   119.82µ ± 1%  +23.87% (p=0.000 n=20)
Mid         48.50µ ± 0%    24.17µ ± 0%  -50.15% (p=0.000 n=20)
Long       84.122µ ± 1%    4.960µ ± 0%  -94.10% (p=0.000 n=20)
geomean     73.35µ         24.31µ       -66.86%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short     1232.5Mi ± 1%    994.9Mi ± 1%    -19.27% (p=0.000 n=20)
Mid        2.400Gi ± 0%    4.816Gi ± 0%   +100.61% (p=0.000 n=20)
Long       1.384Gi ± 1%   23.470Gi ± 0%  +1595.96% (p=0.000 n=20)
geomean    1.587Gi         4.789Gi        +201.71%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
        │ memccpyScalar │             memccpySIMD             │
        │    sec/op     │   sec/op     vs base                │
Short       234.4µ ± 0%   197.0µ ± 0%  -15.95% (p=0.000 n=20)
Mid        115.03µ ± 1%   52.82µ ± 1%  -54.08% (p=0.000 n=20)
Long       178.23µ ± 0%   11.47µ ± 0%  -93.57% (p=0.000 n=20)
geomean     168.7µ        49.23µ       -70.83%

        │ memccpyScalar │               memccpySIMD                │
        │      B/s      │      B/s        vs base                  │
Short      508.6Mi ± 0%     605.1Mi ± 0%    +18.97% (p=0.000 n=20)
Mid        1.012Gi ± 1%     2.204Gi ± 1%   +117.76% (p=0.000 n=20)
Long       668.8Mi ± 0%   10397.0Mi ± 0%  +1454.50% (p=0.000 n=20)
geomean    706.4Mi          2.365Gi        +242.77%

Harbormaster completed remote builds in B58905: Diff 141613.Jul 31 2024, 7:56 PM

getz added inline comments.Jul 31 2024, 7:57 PM

lib/libc/aarch64/string/memccpy.S
27	I tried `ubfiz x12, x1, #2, #4` here but it degraded perfomance a tiny bit

A general comment: shifts instructions take the shift amount modulo the data size. So if the shift amount is wrong by a multiple of the data size, you don't need to go and fix that up. This could save a few additions and subtractions around the code.

Code looks ok. Looking forwards to the acceptance test.

lib/libc/aarch64/string/memccpy.S
38	Comment outdated? You no longer induce a match prior to this.
80–82	It's nicer stylistically to have the label in the same line as the instruction it labels if the label fits there. There are some other places with the same issue.
163–166

Accepted pending final acceptance test.

This revision is now accepted and ready to land.Aug 7 2024, 10:51 AM

getz mentioned this in D46243: lib/libc/aarch64/string: add strlcpy SIMD implementation.Aug 8 2024, 1:54 PM

getz marked 2 inline comments as done.Aug 13 2024, 5:38 PM

getz added inline comments.

lib/libc/aarch64/string/memccpy.S
80–82	I prefer it the way I have it written and I have been doing it this way for all my previous functions. But I'm not opposed to changing it if that's the accepted style.
163–166	I tried this and performance regressed a bit.

Update based on review.

Slightly improved performance

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ memccpyScalar │             memccpySIMD             │
        │    sec/op     │   sec/op     vs base                │
Short       136.7µ ± 1%   139.7µ ± 0%   +2.18% (p=0.000 n=20)
Mid         69.85µ ± 1%   32.87µ ± 0%  -52.94% (p=0.000 n=20)
Long      112.854µ ± 0%   7.884µ ± 2%  -93.01% (p=0.000 n=20)
geomean     102.5µ        33.08µ       -67.73%

        │ memccpyScalar │               memccpySIMD               │
        │      B/s      │      B/s       vs base                  │
Short      871.9Mi ± 1%    853.3Mi ± 0%     -2.13% (p=0.000 n=20)
Mid        1.667Gi ± 1%    3.542Gi ± 0%   +112.50% (p=0.000 n=20)
Long       1.032Gi ± 0%   14.765Gi ± 2%  +1331.36% (p=0.000 n=20)
geomean    1.135Gi         3.519Gi        +209.92%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ memccpyScalar │              memccpySIMD6               │
        │    sec/op     │    sec/op     vs base                   │
Short       96.79µ ± 1%   119.83µ ± 1%  +23.80% (p=0.000 n=40+20)
Mid         48.44µ ± 0%    24.36µ ± 0%  -49.71% (p=0.000 n=40+20)
Long       83.154µ ± 1%    4.964µ ± 0%  -94.03% (p=0.000 n=40+20)
geomean     73.06µ         24.38µ       -66.63%

        │ memccpyScalar │               memccpySIMD6                │
        │     MiB/s     │    MiB/s      vs base                     │
Short       1.291k ± 1%    1.043k ± 1%    -19.23% (p=0.000 n=40+20)
Mid         2.580k ± 0%    5.131k ± 0%    +98.85% (p=0.000 n=40+20)
Long        1.503k ± 1%   25.181k ± 0%  +1575.14% (p=0.000 n=40+20)
geomean     1.711k         5.127k        +199.65%

This revision now requires review to proceed.Aug 13 2024, 5:39 PM

Harbormaster completed remote builds in B59005: Diff 142054.Aug 13 2024, 5:39 PM

getz mentioned this in D46292: lib/libc/aarch64/string: add strncat SIMD implementation.Aug 14 2024, 2:12 PM

getz added a child revision: D46292: lib/libc/aarch64/string: add strncat SIMD implementation.Aug 14 2024, 2:12 PM

getz removed a child revision: D46292: lib/libc/aarch64/string: add strncat SIMD implementation.Aug 25 2024, 6:42 PM

exp-run says it's fine.

This revision is now accepted and ready to land.Wed, Nov 6, 2:25 PM