This changeset adds both a scalar and an x86-64-v2 implementation
of the strcspn(3) function to libc. A baseline implementation does not
appear to be feasible given the requirements of the function.
The scalar implementation is similar to the generic libc implementation,
but expands the bit set into a byte set to reduce latency, improving
performance. This approach could probably be backported to the generic
C version to benefit other platforms.
The x86-64-v2 implementation is built around the infamous pcmpistri
instruction. An alternative implementation
based on the Muła/Langdale algorithm
was prototyped, but performed worse than the pcmpistri approach except
for sets of more than 16 characters with long input strings.
All implementations provide special cases for the empty set (reduces to
strlen as well as single-character sets (reduces to strchr). The
x86-64-v2 kernel falls back to the scalar implementation for sets of more
than 32 characters. This limit could be raised by additional multiples
of 16 through the use of additional pcmpistri code paths, but I consider
this case to be too rare to be of importance.
We currently use the NetBSD test suite to cover this function. It only
contains a very rudimentary test of this function. I have therefore
written an all new set of unit tests for the FreeBSD test suite covering
many edge cases relating to alignment issues.
The new implementations are documented in simd(7) as usual.
Performance is pretty good, beating glibc except for long strings on
size 0/1 sets (the special cases), where glibc likely uses an AVX or
AVX-512 optimised strlen/strchr. The test cases are short/mid/long
as usual, with the trailing number indicating the size of the set.
os: FreeBSD arch: amd64 cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz │ strcspn.x86-64-v2.out │ strcspn.pre.out │ strcspn.scalar.out │ strcspn.mula.out │ │ sec/op │ sec/op vs base │ sec/op vs base │ sec/op vs base │ Short0 51.98µ ± 1% 141.85µ ± 0% +172.89% (p=0.000 n=20) 75.92µ ± 0% +46.05% (p=0.000 n=20) 52.12µ ± 1% ~ (p=0.925 n=20) Mid0 11.10µ ± 0% 78.30µ ± 0% +605.49% (p=0.000 n=20) 20.81µ ± 1% +87.50% (p=0.000 n=20) 11.11µ ± 0% ~ (p=0.718 n=20) Long0 2.227µ ± 0% 48.457µ ± 0% +2075.56% (p=0.000 n=20) 7.944µ ± 0% +256.66% (p=0.000 n=20) 2.228µ ± 0% ~ (p=0.355 n=20) Short1 63.47µ ± 0% 143.43µ ± 0% +125.97% (p=0.000 n=20) 73.99µ ± 0% +16.57% (p=0.000 n=20) 62.53µ ± 0% -1.49% (p=0.000 n=20) Mid1 13.17µ ± 0% 80.17µ ± 0% +508.70% (p=0.000 n=20) 20.63µ ± 0% +56.63% (p=0.000 n=20) 13.21µ ± 0% +0.32% (p=0.000 n=20) Long1 3.295µ ± 0% 48.440µ ± 0% +1369.95% (p=0.000 n=20) 11.034µ ± 0% +234.84% (p=0.000 n=20) 3.293µ ± 0% ~ (p=0.231 n=20) Short5 65.70µ ± 0% 165.04µ ± 0% +151.20% (p=0.000 n=20) 121.95µ ± 0% +85.61% (p=0.000 n=20) 139.56µ ± 1% +112.42% (p=0.000 n=20) Mid5 14.52µ ± 0% 85.18µ ± 0% +486.50% (p=0.000 n=20) 54.45µ ± 0% +274.95% (p=0.000 n=20) 29.62µ ± 1% +103.98% (p=0.000 n=20) Long5 6.029µ ± 0% 50.073µ ± 2% +730.53% (p=0.000 n=20) 35.285µ ± 0% +485.24% (p=0.000 n=20) 7.032µ ± 0% +16.64% (p=0.000 n=20) Short20 66.58µ ± 0% 319.29µ ± 0% +379.52% (p=0.000 n=20) 155.40µ ± 0% +133.39% (p=0.000 n=20) 240.59µ ± 0% +261.33% (p=0.000 n=20) Mid20 19.94µ ± 0% 131.60µ ± 0% +559.84% (p=0.000 n=20) 65.12µ ± 0% +226.51% (p=0.000 n=20) 67.42µ ± 0% +238.06% (p=0.000 n=20) Long20 12.039µ ± 0% 63.676µ ± 0% +428.91% (p=0.000 n=20) 35.195µ ± 0% +192.34% (p=0.000 n=20) 6.801µ ± 0% -43.51% (p=0.000 n=20) Short40 178.3µ ± 2% 565.0µ ± 0% +216.80% (p=0.000 n=20) 196.2µ ± 1% +10.02% (p=0.000 n=20) 409.9µ ± 0% +129.83% (p=0.000 n=20) Mid40 72.20µ ± 1% 198.46µ ± 0% +174.87% (p=0.000 n=20) 74.61µ ± 0% +3.33% (p=0.000 n=20) 115.74µ ± 0% +60.30% (p=0.000 n=20) Long40 35.186µ ± 0% 53.497µ ± 0% +52.04% (p=0.000 n=20) 35.302µ ± 0% +0.33% (p=0.000 n=20) 7.066µ ± 0% -79.92% (p=0.000 n=20) geomean 22.11µ 108.5µ +390.60% 46.13µ +108.64% 27.44µ +24.12% │ strcspn.x86-64-v2.out │ strcspn.pre.out │ strcspn.scalar.out │ strcspn.mula.out │ │ B/s │ B/s vs base │ B/s vs base │ B/s vs base │ Short0 2293.3Mi ± 1% 840.4Mi ± 0% -63.36% (p=0.000 n=20) 1570.2Mi ± 0% -31.53% (p=0.000 n=20) 2287.2Mi ± 1% ~ (p=0.925 n=20) Mid0 10.489Gi ± 0% 1.487Gi ± 0% -85.83% (p=0.000 n=20) 5.594Gi ± 1% -46.67% (p=0.000 n=20) 10.481Gi ± 0% ~ (p=0.718 n=20) Long0 52.266Gi ± 0% 2.402Gi ± 0% -95.40% (p=0.000 n=20) 14.654Gi ± 0% -71.96% (p=0.000 n=20) 52.260Gi ± 0% ~ (p=0.355 n=20) Short1 1878.1Mi ± 0% 831.1Mi ± 0% -55.75% (p=0.000 n=20) 1611.2Mi ± 0% -14.21% (p=0.000 n=20) 1906.5Mi ± 0% +1.51% (p=0.000 n=20) Mid1 8.839Gi ± 0% 1.452Gi ± 0% -83.57% (p=0.000 n=20) 5.643Gi ± 0% -36.16% (p=0.000 n=20) 8.811Gi ± 0% -0.31% (p=0.000 n=20) Long1 35.327Gi ± 0% 2.403Gi ± 0% -93.20% (p=0.000 n=20) 10.551Gi ± 0% -70.13% (p=0.000 n=20) 35.351Gi ± 0% ~ (p=0.231 n=20) Short5 1814.4Mi ± 0% 722.3Mi ± 0% -60.19% (p=0.000 n=20) 977.5Mi ± 0% -46.12% (p=0.000 n=20) 854.2Mi ± 1% -52.92% (p=0.000 n=20) Mid5 8.016Gi ± 0% 1.367Gi ± 0% -82.95% (p=0.000 n=20) 2.138Gi ± 0% -73.33% (p=0.000 n=20) 3.930Gi ± 1% -50.98% (p=0.000 n=20) Long5 19.309Gi ± 0% 2.325Gi ± 2% -87.96% (p=0.000 n=20) 3.299Gi ± 0% -82.91% (p=0.000 n=20) 16.555Gi ± 0% -14.26% (p=0.000 n=20) Short20 1790.3Mi ± 0% 373.4Mi ± 0% -79.15% (p=0.000 n=20) 767.1Mi ± 0% -57.15% (p=0.000 n=20) 495.5Mi ± 0% -72.32% (p=0.000 n=20) Mid20 5977.2Mi ± 0% 905.9Mi ± 0% -84.84% (p=0.000 n=20) 1830.6Mi ± 0% -69.37% (p=0.000 n=20) 1768.1Mi ± 0% -70.42% (p=0.000 n=20) Long20 9.670Gi ± 0% 1.828Gi ± 0% -81.09% (p=0.000 n=20) 3.308Gi ± 0% -65.79% (p=0.000 n=20) 17.117Gi ± 0% +77.01% (p=0.000 n=20) Short40 668.5Mi ± 2% 211.0Mi ± 0% -68.43% (p=0.000 n=20) 607.6Mi ± 1% -9.11% (p=0.000 n=20) 290.9Mi ± 0% -56.49% (p=0.000 n=20) Mid40 1651.1Mi ± 1% 600.7Mi ± 0% -63.62% (p=0.000 n=20) 1597.8Mi ± 0% -3.23% (p=0.000 n=20) 1030.0Mi ± 0% -37.62% (p=0.000 n=20) Long40 3.309Gi ± 0% 2.176Gi ± 0% -34.23% (p=0.000 n=20) 3.298Gi ± 0% -0.33% (p=0.000 n=20) 16.474Gi ± 0% +397.92% (p=0.000 n=20) geomean 5.265Gi 1.073Gi -79.62% 2.524Gi -52.07% 4.242Gi -19.44% os: Linux arch: x86_64 cpu: │ strcspn.glibc.out │ │ sec/op │ Short0 49.97µ ± 0% Mid0 13.48µ ± 0% Long0 1.033µ ± 0% Short1 70.64µ ± 0% Mid1 21.18µ ± 0% Long1 7.920µ ± 0% Short5 70.53µ ± 0% Mid5 21.18µ ± 0% Long5 7.920µ ± 0% Short20 201.2µ ± 0% Mid20 78.00µ ± 0% Long20 35.65µ ± 0% Short40 253.6µ ± 0% Mid40 92.19µ ± 0% Long40 35.68µ ± 0% geomean 32.40µ │ strcspn.glibc.out │ │ B/s │ Short0 2.330Gi ± 0% Mid0 8.639Gi ± 0% Long0 112.7Gi ± 0% Short1 1.648Gi ± 0% Mid1 5.496Gi ± 0% Long1 14.70Gi ± 0% Short5 1.651Gi ± 0% Mid5 5.497Gi ± 0% Long5 14.70Gi ± 0% Short20 592.6Mi ± 0% Mid20 1.493Gi ± 0% Long20 3.266Gi ± 0% Short40 470.0Mi ± 0% Mid40 1.263Gi ± 0% Long40 3.262Gi ± 0% geomean 3.593Gi
Sponsored by: The FreeBSD Foundation