Page MenuHomeFreeBSD

lib/libc/aarch64/string: add memcpy SIMD implementation
ClosedPublic

Authored by getz on Aug 9 2024, 1:18 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 24, 5:40 PM
Unknown Object (File)
Fri, Jan 17, 3:27 PM
Unknown Object (File)
Mon, Jan 13, 2:08 AM
Unknown Object (File)
Fri, Jan 10, 5:34 PM
Unknown Object (File)
Sat, Jan 4, 3:02 AM
Unknown Object (File)
Dec 27 2024, 5:55 AM
Unknown Object (File)
Dec 26 2024, 1:44 PM
Unknown Object (File)
Dec 1 2024, 4:52 PM
Subscribers

Details

Summary

I noticed that we have a SIMD optimized memcpy in the
arm-optimized-routines in /contrib.

This patch ensures we use the SIMD variant as opposed to the
Scalar optimized variant.

Benchmarks are available below generated by fuz' strperf utility.

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ memcpyScalar │             memcpySIMD              │
        │    sec/op    │   sec/op     vs base                │
64         30.71µ ± 0%   22.47µ ± 1%  -26.83% (p=0.000 n=20)
4k         7.875µ ± 0%   4.069µ ± 0%  -48.33% (p=0.000 n=20)
256k       6.608µ ± 0%   5.126µ ± 0%  -22.43% (p=0.000 n=20)
16m        512.0µ ± 0%   503.0µ ± 0%   -1.75% (p=0.000 n=20)
1g         41.42m ± 0%   39.73m ± 0%   -4.08% (p=0.000 n=20)
geomean    127.7µ        98.70µ       -22.68%

        │ memcpyScalar │              memcpySIMD               │
        │     B/s      │      B/s       vs base                │
64        7.582Gi ± 0%   10.362Gi ± 1%  +36.68% (p=0.000 n=20)
4k        29.57Gi ± 0%    57.22Gi ± 0%  +93.55% (p=0.000 n=20)
256k      35.23Gi ± 0%    45.42Gi ± 0%  +28.91% (p=0.000 n=20)
16m       29.11Gi ± 0%    29.62Gi ± 0%   +1.78% (p=0.000 n=20)
1g        23.02Gi ± 0%    24.00Gi ± 0%   +4.26% (p=0.000 n=20)
geomean   22.12Gi         28.60Gi       +29.33%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ memcpyScalar │             memcpySIMD              │
        │    sec/op    │   sec/op     vs base                │
64         51.55µ ± 0%   46.25µ ± 0%  -10.29% (p=0.000 n=20)
4k         9.866µ ± 0%   7.253µ ± 0%  -26.48% (p=0.000 n=20)
256k       7.044µ ± 0%   7.793µ ± 0%  +10.64% (p=0.000 n=20)
16m        3.523m ± 6%   3.707m ± 5%        ~ (p=0.602 n=20)
1g         209.3m ± 1%   211.3m ± 1%   +0.93% (p=0.035 n=20)
geomean    305.1µ        289.9µ        -4.97%

        │ memcpyScalar │              memcpySIMD              │
        │     B/s      │     B/s       vs base                │
64        4.516Gi ± 0%   5.035Gi ± 0%  +11.48% (p=0.000 n=20)
4k        23.60Gi ± 0%   32.10Gi ± 0%  +36.02% (p=0.000 n=20)
256k      33.05Gi ± 0%   29.88Gi ± 0%   -9.62% (p=0.000 n=20)
16m       4.230Gi ± 5%   4.020Gi ± 5%        ~ (p=0.602 n=20)
1g        4.556Gi ± 1%   4.514Gi ± 1%   -0.92% (p=0.035 n=20)
geomean   9.255Gi        9.739Gi        +5.23%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
        │ memcpyScalar │             memcpySIMD             │
        │    sec/op    │   sec/op     vs base               │
64         67.58µ ± 0%   64.87µ ± 0%  -4.00% (p=0.000 n=20)
4k         14.42µ ± 0%   14.43µ ± 0%       ~ (p=0.478 n=20)
256k       14.68µ ± 1%   14.76µ ± 1%       ~ (p=0.192 n=20)
16m        1.513m ± 1%   1.500m ± 1%       ~ (p=0.301 n=20)
1g         86.77m ± 2%   87.08m ± 1%       ~ (p=0.640 n=20)
geomean    284.9µ        282.7µ       -0.78%

        │ memcpyScalar │             memcpySIMD              │
        │     B/s      │     B/s       vs base               │
64        3.445Gi ± 0%   3.589Gi ± 0%  +4.17% (p=0.000 n=20)
4k        16.15Gi ± 0%   16.14Gi ± 0%       ~ (p=0.478 n=20)
256k      15.86Gi ± 1%   15.77Gi ± 1%       ~ (p=0.192 n=20)
16m       9.850Gi ± 1%   9.931Gi ± 1%       ~ (p=0.301 n=20)
1g        10.99Gi ± 2%   10.95Gi ± 1%       ~ (p=0.640 n=20)
geomean   9.909Gi        9.987Gi       +0.78%
Test Plan

No regressions in the test suite noticed, all tests pass

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable