This changeset includes a port of the SIMD implementation of strncmp
for amd64 to Aarch64.
It is based on D45839 with added handling for the limit.
An extended unit test for strncmp is currently being written to make
sure the bounds checks for page crossings work as expected.
Performance is significantly better than the existing implementation
from the Arm Optimized Routines repository.
Benchmark results are as usual generated by the strperf utility written
by fuz.
os: FreeBSD arch: arm64 cpu: ARM Neoverse-V1 r1p1 │ strncmpARM │ strncmpSIMD │ │ sec/op │ sec/op vs base │ StrncmpShortAligned 97.06µ ± 2% 63.24µ ± 2% -34.85% (p=0.000 n=20) StrncmpMidAligned 28.59µ ± 1% 20.61µ ± 1% -27.92% (p=0.000 n=20) StrncmpLongAligned 18.363µ ± 2% 9.330µ ± 2% -49.19% (p=0.000 n=20) StrncmpShortUnaligned 148.56µ ± 1% 72.96µ ± 0% -50.89% (p=0.000 n=20) StrncmpMidUnaligned 43.26µ ± 1% 22.37µ ± 1% -48.30% (p=0.000 n=20) StrncmpLongUnaligned 25.327µ ± 0% 9.508µ ± 2% -62.46% (p=0.000 n=20) StrncmpShortQsort 1.374m ± 0% 1.339m ± 0% -2.55% (p=0.000 n=20) StrncmpMidQsort 306.9µ ± 0% 275.7µ ± 0% -10.16% (p=0.000 n=20) geomean 87.70µ 53.75µ -38.71% │ strncmpARM │ strncmpSIMD │ │ B/s │ B/s vs base │ StrncmpShortAligned 1.199Gi ± 2% 1.841Gi ± 2% +53.48% (p=0.000 n=20) StrncmpMidAligned 4.072Gi ± 1% 5.649Gi ± 1% +38.74% (p=0.000 n=20) StrncmpLongAligned 6.339Gi ± 2% 12.480Gi ± 2% +96.86% (p=0.000 n=20) StrncmpShortUnaligned 802.4Mi ± 1% 1634.0Mi ± 0% +103.63% (p=0.000 n=20) StrncmpMidUnaligned 2.691Gi ± 1% 5.205Gi ± 1% +93.43% (p=0.000 n=20) StrncmpLongUnaligned 4.597Gi ± 0% 12.245Gi ± 2% +166.38% (p=0.000 n=20) StrncmpShortQsort 86.76Mi ± 0% 89.03Mi ± 0% +2.62% (p=0.000 n=20) StrncmpMidQsort 388.4Mi ± 0% 432.3Mi ± 0% +11.31% (p=0.000 n=20) geomean 1.327Gi 2.166Gi +63.17% os: FreeBSD arch: arm64 cpu: ARM Cortex-A76 r4p1 │ strncmpARM │ strncmpSIMD │ │ sec/op │ sec/op vs base │ StrncmpShortAligned 144.8µ ± 0% 100.2µ ± 0% -30.79% (p=0.000 n=20) StrncmpMidAligned 49.89µ ± 0% 39.26µ ± 1% -21.30% (p=0.000 n=20) StrncmpLongAligned 23.29µ ± 0% 15.38µ ± 0% -33.95% (p=0.000 n=20) StrncmpShortUnaligned 195.3µ ± 0% 112.0µ ± 0% -42.66% (p=0.000 n=20) StrncmpMidUnaligned 68.18µ ± 1% 42.81µ ± 1% -37.22% (p=0.000 n=20) StrncmpLongUnaligned 33.44µ ± 0% 16.92µ ± 0% -49.41% (p=0.000 n=20) StrncmpShortQsort 1.770m ± 0% 1.801m ± 0% +1.72% (p=0.000 n=20) StrncmpMidQsort 413.5µ ± 0% 390.7µ ± 0% -5.51% (p=0.000 n=20) geomean 123.7µ 87.56µ -29.22% │ strncmpARM │ strncmpSIMD │ │ B/s │ B/s vs base │ StrncmpShortAligned 823.5Mi ± 0% 1189.8Mi ± 0% +44.48% (p=0.000 n=20) StrncmpMidAligned 2.333Gi ± 0% 2.965Gi ± 1% +27.06% (p=0.000 n=20) StrncmpLongAligned 4.998Gi ± 0% 7.567Gi ± 0% +51.40% (p=0.000 n=20) StrncmpShortUnaligned 610.4Mi ± 0% 1064.4Mi ± 0% +74.39% (p=0.000 n=20) StrncmpMidUnaligned 1.707Gi ± 1% 2.720Gi ± 1% +59.28% (p=0.000 n=20) StrncmpLongUnaligned 3.481Gi ± 0% 6.881Gi ± 0% +97.65% (p=0.000 n=20) StrncmpShortQsort 67.34Mi ± 0% 66.20Mi ± 0% -1.69% (p=0.000 n=20) StrncmpMidQsort 288.3Mi ± 0% 305.1Mi ± 0% +5.83% (p=0.000 n=20) geomean 963.7Mi 1.330Gi +41.28% os: FreeBSD arch: arm64 cpu: ARM Cortex-A78C r0p0 │ strncmpARM │ strncmpSIMD │ │ sec/op │ sec/op vs base │ StrncmpShortAligned 193.7µ ± 0% 135.5µ ± 0% -30.08% (p=0.000 n=20) StrncmpMidAligned 62.40µ ± 1% 51.60µ ± 1% -17.31% (p=0.000 n=20) StrncmpLongAligned 34.20µ ± 0% 22.82µ ± 0% -33.28% (p=0.000 n=20) StrncmpShortUnaligned 277.5µ ± 0% 153.4µ ± 0% -44.71% (p=0.000 n=20) StrncmpMidUnaligned 98.78µ ± 1% 58.79µ ± 0% -40.48% (p=0.000 n=20) StrncmpLongUnaligned 45.96µ ± 0% 22.84µ ± 0% -50.31% (p=0.000 n=20) StrncmpShortQsort 2.524m ± 0% 2.402m ± 0% -4.83% (p=0.000 n=20) StrncmpMidQsort 577.6µ ± 0% 504.4µ ± 0% -12.67% (p=0.000 n=20) geomean 171.8µ 118.9µ -30.83% │ strncmpARM │ strncmpSIMD │ │ B/s │ B/s vs base │ StrncmpShortAligned 615.3Mi ± 0% 880.0Mi ± 0% +43.01% (p=0.000 n=20) StrncmpMidAligned 1.865Gi ± 1% 2.256Gi ± 1% +20.93% (p=0.000 n=20) StrncmpLongAligned 3.404Gi ± 0% 5.103Gi ± 0% +49.89% (p=0.000 n=20) StrncmpShortUnaligned 429.6Mi ± 0% 777.0Mi ± 0% +80.88% (p=0.000 n=20) StrncmpMidUnaligned 1.179Gi ± 1% 1.980Gi ± 0% +68.02% (p=0.000 n=20) StrncmpLongUnaligned 2.533Gi ± 0% 5.097Gi ± 0% +101.25% (p=0.000 n=20) StrncmpShortQsort 47.23Mi ± 0% 49.62Mi ± 0% +5.07% (p=0.000 n=20) StrncmpMidQsort 206.4Mi ± 0% 236.3Mi ± 0% +14.50% (p=0.000 n=20) geomean 693.8Mi 1003.0Mi +44.56%