I noticed that we have a SIMD optimized memcpy in the
arm-optimized-routines in /contrib.
This patch ensures we use the SIMD variant as opposed to the
Scalar optimized variant.
Benchmarks are available below generated by fuz' strperf utility.
os: FreeBSD arch: arm64 cpu: ARM Neoverse-V1 r1p1 │ memcpyScalar │ memcpySIMD │ │ sec/op │ sec/op vs base │ 64 30.71µ ± 0% 22.47µ ± 1% -26.83% (p=0.000 n=20) 4k 7.875µ ± 0% 4.069µ ± 0% -48.33% (p=0.000 n=20) 256k 6.608µ ± 0% 5.126µ ± 0% -22.43% (p=0.000 n=20) 16m 512.0µ ± 0% 503.0µ ± 0% -1.75% (p=0.000 n=20) 1g 41.42m ± 0% 39.73m ± 0% -4.08% (p=0.000 n=20) geomean 127.7µ 98.70µ -22.68% │ memcpyScalar │ memcpySIMD │ │ B/s │ B/s vs base │ 64 7.582Gi ± 0% 10.362Gi ± 1% +36.68% (p=0.000 n=20) 4k 29.57Gi ± 0% 57.22Gi ± 0% +93.55% (p=0.000 n=20) 256k 35.23Gi ± 0% 45.42Gi ± 0% +28.91% (p=0.000 n=20) 16m 29.11Gi ± 0% 29.62Gi ± 0% +1.78% (p=0.000 n=20) 1g 23.02Gi ± 0% 24.00Gi ± 0% +4.26% (p=0.000 n=20) geomean 22.12Gi 28.60Gi +29.33% os: FreeBSD arch: arm64 cpu: ARM Cortex-A76 r4p1 │ memcpyScalar │ memcpySIMD │ │ sec/op │ sec/op vs base │ 64 51.55µ ± 0% 46.25µ ± 0% -10.29% (p=0.000 n=20) 4k 9.866µ ± 0% 7.253µ ± 0% -26.48% (p=0.000 n=20) 256k 7.044µ ± 0% 7.793µ ± 0% +10.64% (p=0.000 n=20) 16m 3.523m ± 6% 3.707m ± 5% ~ (p=0.602 n=20) 1g 209.3m ± 1% 211.3m ± 1% +0.93% (p=0.035 n=20) geomean 305.1µ 289.9µ -4.97% │ memcpyScalar │ memcpySIMD │ │ B/s │ B/s vs base │ 64 4.516Gi ± 0% 5.035Gi ± 0% +11.48% (p=0.000 n=20) 4k 23.60Gi ± 0% 32.10Gi ± 0% +36.02% (p=0.000 n=20) 256k 33.05Gi ± 0% 29.88Gi ± 0% -9.62% (p=0.000 n=20) 16m 4.230Gi ± 5% 4.020Gi ± 5% ~ (p=0.602 n=20) 1g 4.556Gi ± 1% 4.514Gi ± 1% -0.92% (p=0.035 n=20) geomean 9.255Gi 9.739Gi +5.23% os: FreeBSD arch: arm64 cpu: ARM Cortex-A78C r0p0 │ memcpyScalar │ memcpySIMD │ │ sec/op │ sec/op vs base │ 64 67.58µ ± 0% 64.87µ ± 0% -4.00% (p=0.000 n=20) 4k 14.42µ ± 0% 14.43µ ± 0% ~ (p=0.478 n=20) 256k 14.68µ ± 1% 14.76µ ± 1% ~ (p=0.192 n=20) 16m 1.513m ± 1% 1.500m ± 1% ~ (p=0.301 n=20) 1g 86.77m ± 2% 87.08m ± 1% ~ (p=0.640 n=20) geomean 284.9µ 282.7µ -0.78% │ memcpyScalar │ memcpySIMD │ │ B/s │ B/s vs base │ 64 3.445Gi ± 0% 3.589Gi ± 0% +4.17% (p=0.000 n=20) 4k 16.15Gi ± 0% 16.14Gi ± 0% ~ (p=0.478 n=20) 256k 15.86Gi ± 1% 15.77Gi ± 1% ~ (p=0.192 n=20) 16m 9.850Gi ± 1% 9.931Gi ± 1% ~ (p=0.301 n=20) 1g 10.99Gi ± 2% 10.95Gi ± 1% ~ (p=0.640 n=20) geomean 9.909Gi 9.987Gi +0.78%