This changeset includes a port of the SIMD implementation of memcmp
for amd64 to Aarch64.
It also solves an issue with the existing implementation for
Aarch64 where the return value is not in accordance with the
man page and only returns -1,0 or 1 instead of the byte difference.
Performance is better than the existing memcmp implementation
borrowed from the Arm Optimized Routines except for long strings.
os: FreeBSD arch: arm64 cpu: ARM Neoverse-V1 r1p1 │ memcmpARM │ memcmpSIMD │ │ sec/op │ sec/op vs base │ MemcmpShort 63.96µ ± 1% 32.41µ ± 0% -49.33% (p=0.000 n=20) MemcmpMid 12.09µ ± 1% 12.33µ ± 1% +1.98% (p=0.000 n=20) MemcmpLong 4.648µ ± 1% 4.942µ ± 1% +6.32% (p=0.000 n=20) geomean 15.32µ 12.55µ -18.10% │ memcmpARM │ memcmpSIMD │ │ B/s │ B/s vs base │ MemcmpShort 1.820Gi ± 1% 3.592Gi ± 0% +97.35% (p=0.000 n=20) MemcmpMid 9.629Gi ± 1% 9.442Gi ± 1% -1.94% (p=0.000 n=20) MemcmpLong 25.05Gi ± 1% 23.55Gi ± 1% -5.96% (p=0.000 n=20) geomean 7.600Gi 9.279Gi +22.09% os: FreeBSD arch: arm64 cpu: ARM Cortex-A78C r0p0 │ memcmpARM │ memcmpSIMD │ │ sec/op │ sec/op vs base │ MemcmpShort 136.11µ ± 3% 69.69µ ± 0% -48.80% (p=0.000 n=20) MemcmpMid 34.16µ ± 1% 32.55µ ± 1% -4.71% (p=0.000 n=20) MemcmpLong 9.382µ ± 0% 8.972µ ± 1% -4.37% (p=0.000 n=20) geomean 35.20µ 27.30µ -22.44% │ memcmpARM │ memcmpSIMD │ │ B/s │ B/s vs base │ MemcmpShort 875.8Mi ± 3% 1710.6Mi ± 0% +95.31% (p=0.000 n=20) MemcmpMid 3.408Gi ± 1% 3.577Gi ± 1% +4.94% (p=0.000 n=20) MemcmpLong 12.41Gi ± 0% 12.98Gi ± 1% +4.57% (p=0.000 n=20) geomean 3.307Gi 4.264Gi +28.93%