The scalar implementation is fairly straightforward and merely unrolled
four times. The baseline implementation closely follows D41971 with
appropriate extensions and extra code paths to pay attention to string
length.
The strcmp unit tests from D41970 have been copied and adapted to
strncmp. As always, the man page simd(7) is updated to account for the
new code.
Performance is quite good. We beat both glibc (except for very long
strings, but they likely use AVX which we don't) and Bionic (except for
medium-sized aligned strings, where we are still in the same ballpark).
Curiously enough, the fastest implementation for the qsort benchmark
(sorting random arrays of strings) is the old generic C implementation,
though the scalar and baseline implementations are only slightly slower
than it.
Please see benchmark results at the link or below:
os: FreeBSD arch: amd64 cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz │ strncmp.pre.out │ strncmp.scalar.out │ strncmp.baseline.out │ strncmp.bionic.out │ │ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │ StrncmpShortAligned 137.51µ ± 0% 137.32µ ± 1% ~ (p=0.398 n=20) 40.22µ ± 0% -70.75% (p=0.000 n=20) 62.65µ ± 1% -54.44% (p=0.000 n=20) StrncmpMidAligned 88.42µ ± 0% 56.69µ ± 1% -35.89% (p=0.000 n=20) 13.13µ ± 0% -85.15% (p=0.000 n=20) 11.24µ ± 0% -87.29% (p=0.000 n=20) StrncmpLongAligned 71.648µ ± 0% 36.185µ ± 0% -49.50% (p=0.000 n=20) 4.753µ ± 0% -93.37% (p=0.000 n=20) 4.855µ ± 0% -93.22% (p=0.000 n=20) StrncmpShortUnaligned 135.76µ ± 0% 146.14µ ± 3% +7.65% (p=0.000 n=20) 42.91µ ± 0% -68.39% (p=0.000 n=20) 111.93µ ± 0% -17.55% (p=0.000 n=20) StrncmpMidUnaligned 88.90µ ± 0% 57.62µ ± 5% -35.19% (p=0.000 n=20) 13.22µ ± 0% -85.13% (p=0.000 n=20) 23.69µ ± 0% -73.35% (p=0.000 n=20) StrncmpLongUnaligned 71.640µ ± 0% 36.176µ ± 0% -49.50% (p=0.000 n=20) 4.758µ ± 0% -93.36% (p=0.000 n=20) 6.225µ ± 0% -91.31% (p=0.000 n=20) StrncmpShortQsort 901.1µ ± 0% 976.8µ ± 0% +8.40% (p=0.000 n=20) 1118.2µ ± 0% +24.10% (p=0.000 n=20) 1339.6µ ± 0% +48.66% (p=0.000 n=20) StrncmpMidQsort 203.5µ ± 0% 206.9µ ± 0% +1.63% (p=0.000 n=20) 220.1µ ± 0% +8.13% (p=0.000 n=20) 316.5µ ± 0% +55.48% (p=0.000 n=20) geomean 138.8µ 107.1µ -22.85% 33.71µ -75.72% 47.03µ -66.13% │ strncmp.pre.out │ strncmp.scalar.out │ strncmp.baseline.out │ strncmp.bionic.out │ │ B/s │ B/s vs base │ B/s vs base │ B/s vs base │ StrncmpShortAligned 866.9Mi ± 0% 868.1Mi ± 1% ~ (p=0.398 n=20) 2963.9Mi ± 0% +241.90% (p=0.000 n=20) 1902.8Mi ± 1% +119.49% (p=0.000 n=20) StrncmpMidAligned 1.317Gi ± 0% 2.054Gi ± 1% +55.98% (p=0.000 n=20) 8.867Gi ± 0% +573.46% (p=0.000 n=20) 10.360Gi ± 0% +686.86% (p=0.000 n=20) StrncmpLongAligned 1.625Gi ± 0% 3.217Gi ± 0% +98.00% (p=0.000 n=20) 24.491Gi ± 0% +1407.32% (p=0.000 n=20) 23.977Gi ± 0% +1375.68% (p=0.000 n=20) StrncmpShortUnaligned 878.1Mi ± 0% 815.7Mi ± 3% -7.10% (p=0.000 n=20) 2778.3Mi ± 0% +216.40% (p=0.000 n=20) 1065.0Mi ± 0% +21.29% (p=0.000 n=20) StrncmpMidUnaligned 1.309Gi ± 0% 2.020Gi ± 4% +54.29% (p=0.000 n=20) 8.804Gi ± 0% +572.35% (p=0.000 n=20) 4.914Gi ± 0% +275.26% (p=0.000 n=20) StrncmpLongUnaligned 1.625Gi ± 0% 3.218Gi ± 0% +98.03% (p=0.000 n=20) 24.466Gi ± 0% +1405.57% (p=0.000 n=20) 18.702Gi ± 0% +1050.89% (p=0.000 n=20) StrncmpShortQsort 132.29Mi ± 0% 122.05Mi ± 0% -7.74% (p=0.000 n=20) 106.61Mi ± 0% -19.42% (p=0.000 n=20) 88.99Mi ± 0% -32.73% (p=0.000 n=20) StrncmpMidQsort 585.7Mi ± 0% 576.3Mi ± 0% -1.60% (p=0.000 n=20) 541.6Mi ± 0% -7.52% (p=0.000 n=20) 376.7Mi ± 0% -35.68% (p=0.000 n=20) geomean 858.5Mi 1.087Gi +29.62% 3.453Gi +311.89% 2.476Gi +195.26% os: Linux arch: x86_64 cpu: │ strncmp.glibc.out │ │ sec/op │ StrncmpShortAligned 52.57µ ± 0% StrncmpMidAligned 17.31µ ± 0% StrncmpLongAligned 2.318µ ± 0% StrncmpShortUnaligned 55.26µ ± 0% StrncmpMidUnaligned 26.62µ ± 0% StrncmpLongUnaligned 2.323µ ± 0% StrncmpShortQsort 1.260m ± 0% StrncmpMidQsort 279.3µ ± 0% geomean 35.53µ │ strncmp.glibc.out │ │ B/s │ StrncmpShortAligned 2.214Gi ± 0% StrncmpMidAligned 6.725Gi ± 0% StrncmpLongAligned 50.21Gi ± 0% StrncmpShortUnaligned 2.107Gi ± 0% StrncmpMidUnaligned 4.373Gi ± 0% StrncmpLongUnaligned 50.11Gi ± 0% StrncmpShortQsort 94.64Mi ± 0% StrncmpMidQsort 426.8Mi ± 0% geomean 3.277Gi
Sponsored by: The FreeBSD Foundation
lib/libc/amd64/string/strncmp.S: fix whitespace issue