This changeset includes a port of the SIMD implementation of strlcpy
for amd64 to Aarch64.
It is based on memccpy (D46170) with some minor differences.
Performance is significantly better than the scalar implementation.
Benchmark results are as usual generated by the strperf utility written
by fuz.
os: FreeBSD arch: arm64 cpu: ARM Cortex-A76 r4p1 │ strlcpyScalar │ strlcpySIMD │ │ sec/op │ sec/op vs base │ Short 202.7µ ± 1% 167.3µ ± 0% -17.48% (p=0.000 n=20) Mid 121.67µ ± 1% 39.75µ ± 1% -67.33% (p=0.000 n=20) Long 109.359µ ± 0% 7.928µ ± 3% -92.75% (p=0.000 n=20) geomean 139.2µ 37.50µ -73.06% │ strlcpyScalar │ strlcpySIMD │ │ B/s │ B/s vs base │ Short 588.1Mi ± 1% 712.6Mi ± 0% +21.18% (p=0.000 n=20) Mid 979.7Mi ± 1% 2998.8Mi ± 1% +206.08% (p=0.000 n=20) Long 1.065Gi ± 0% 14.684Gi ± 3% +1279.42% (p=0.000 n=20) geomean 856.4Mi 3.105Gi +271.24% os: FreeBSD arch: arm64 cpu: ARM Neoverse-V1 r1p1 │ strlcpyScalar │ strlcpySIMD │ │ sec/op │ sec/op vs base │ Short 143.4µ ± 1% 138.9µ ± 1% -3.17% (p=0.000 n=20) Mid 66.48µ ± 0% 24.06µ ± 1% -63.81% (p=0.000 n=20) Long 70.863µ ± 0% 4.961µ ± 0% -93.00% (p=0.000 n=20) geomean 87.75µ 25.50µ -70.94% │ strlcpyScalar │ strlcpySIMD │ │ B/s │ B/s vs base │ Short 831.2Mi ± 1% 858.5Mi ± 1% +3.28% (p=0.000 n=20) Mid 1.751Gi ± 0% 4.839Gi ± 1% +176.32% (p=0.000 n=20) Long 1.643Gi ± 0% 23.466Gi ± 0% +1328.41% (p=0.000 n=20) geomean 1.327Gi 4.566Gi +244.17%