Page MenuHomeFreeBSD

lib/libc/amd64/string: add strcspn(3) scalar, x86-64-v2 implementation, unit tests
ClosedPublic

Authored by fuz on Aug 22 2023, 11:27 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Nov 5, 12:11 PM
Unknown Object (File)
Wed, Oct 23, 4:48 AM
Unknown Object (File)
Oct 2 2024, 11:56 PM
Unknown Object (File)
Sep 30 2024, 1:12 PM
Unknown Object (File)
Sep 14 2024, 12:20 PM
Unknown Object (File)
Sep 10 2024, 10:11 AM
Unknown Object (File)
Aug 31 2024, 12:26 AM
Unknown Object (File)
Aug 30 2024, 4:49 AM
Subscribers

Details

Summary

This changeset adds both a scalar and an x86-64-v2 implementation
of the strcspn(3) function to libc. A baseline implementation does not
appear to be feasible given the requirements of the function.

The scalar implementation is similar to the generic libc implementation,
but expands the bit set into a byte set to reduce latency, improving
performance. This approach could probably be backported to the generic
C version to benefit other platforms.

The x86-64-v2 implementation is built around the infamous pcmpistri
instruction. An alternative implementation
based on the Muła/Langdale algorithm
was prototyped, but performed worse than the pcmpistri approach except
for sets of more than 16 characters with long input strings.

All implementations provide special cases for the empty set (reduces to
strlen as well as single-character sets (reduces to strchr). The
x86-64-v2 kernel falls back to the scalar implementation for sets of more
than 32 characters. This limit could be raised by additional multiples
of 16 through the use of additional pcmpistri code paths, but I consider
this case to be too rare to be of importance.

We currently use the NetBSD test suite to cover this function. It only
contains a very rudimentary test of this function. I have therefore
written an all new set of unit tests for the FreeBSD test suite covering
many edge cases relating to alignment issues.

The new implementations are documented in simd(7) as usual.

Performance is pretty good, beating glibc except for long strings on
size 0/1 sets (the special cases), where glibc likely uses an AVX or
AVX-512 optimised strlen/strchr. The test cases are short/mid/long
as usual, with the trailing number indicating the size of the set.

os: FreeBSD
arch: amd64
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
        │ strcspn.x86-64-v2.out │            strcspn.pre.out             │          strcspn.scalar.out           │           strcspn.mula.out            │
        │        sec/op         │    sec/op     vs base                  │    sec/op     vs base                 │    sec/op     vs base                 │
Short0              51.98µ ± 1%   141.85µ ± 0%   +172.89% (p=0.000 n=20)    75.92µ ± 0%   +46.05% (p=0.000 n=20)    52.12µ ± 1%         ~ (p=0.925 n=20)
Mid0                11.10µ ± 0%    78.30µ ± 0%   +605.49% (p=0.000 n=20)    20.81µ ± 1%   +87.50% (p=0.000 n=20)    11.11µ ± 0%         ~ (p=0.718 n=20)
Long0               2.227µ ± 0%   48.457µ ± 0%  +2075.56% (p=0.000 n=20)    7.944µ ± 0%  +256.66% (p=0.000 n=20)    2.228µ ± 0%         ~ (p=0.355 n=20)
Short1              63.47µ ± 0%   143.43µ ± 0%   +125.97% (p=0.000 n=20)    73.99µ ± 0%   +16.57% (p=0.000 n=20)    62.53µ ± 0%    -1.49% (p=0.000 n=20)
Mid1                13.17µ ± 0%    80.17µ ± 0%   +508.70% (p=0.000 n=20)    20.63µ ± 0%   +56.63% (p=0.000 n=20)    13.21µ ± 0%    +0.32% (p=0.000 n=20)
Long1               3.295µ ± 0%   48.440µ ± 0%  +1369.95% (p=0.000 n=20)   11.034µ ± 0%  +234.84% (p=0.000 n=20)    3.293µ ± 0%         ~ (p=0.231 n=20)
Short5              65.70µ ± 0%   165.04µ ± 0%   +151.20% (p=0.000 n=20)   121.95µ ± 0%   +85.61% (p=0.000 n=20)   139.56µ ± 1%  +112.42% (p=0.000 n=20)
Mid5                14.52µ ± 0%    85.18µ ± 0%   +486.50% (p=0.000 n=20)    54.45µ ± 0%  +274.95% (p=0.000 n=20)    29.62µ ± 1%  +103.98% (p=0.000 n=20)
Long5               6.029µ ± 0%   50.073µ ± 2%   +730.53% (p=0.000 n=20)   35.285µ ± 0%  +485.24% (p=0.000 n=20)    7.032µ ± 0%   +16.64% (p=0.000 n=20)
Short20             66.58µ ± 0%   319.29µ ± 0%   +379.52% (p=0.000 n=20)   155.40µ ± 0%  +133.39% (p=0.000 n=20)   240.59µ ± 0%  +261.33% (p=0.000 n=20)
Mid20               19.94µ ± 0%   131.60µ ± 0%   +559.84% (p=0.000 n=20)    65.12µ ± 0%  +226.51% (p=0.000 n=20)    67.42µ ± 0%  +238.06% (p=0.000 n=20)
Long20             12.039µ ± 0%   63.676µ ± 0%   +428.91% (p=0.000 n=20)   35.195µ ± 0%  +192.34% (p=0.000 n=20)    6.801µ ± 0%   -43.51% (p=0.000 n=20)
Short40             178.3µ ± 2%    565.0µ ± 0%   +216.80% (p=0.000 n=20)    196.2µ ± 1%   +10.02% (p=0.000 n=20)    409.9µ ± 0%  +129.83% (p=0.000 n=20)
Mid40               72.20µ ± 1%   198.46µ ± 0%   +174.87% (p=0.000 n=20)    74.61µ ± 0%    +3.33% (p=0.000 n=20)   115.74µ ± 0%   +60.30% (p=0.000 n=20)
Long40             35.186µ ± 0%   53.497µ ± 0%    +52.04% (p=0.000 n=20)   35.302µ ± 0%    +0.33% (p=0.000 n=20)    7.066µ ± 0%   -79.92% (p=0.000 n=20)
geomean             22.11µ         108.5µ        +390.60%                   46.13µ       +108.64%                   27.44µ        +24.12%

        │ strcspn.x86-64-v2.out │           strcspn.pre.out            │          strcspn.scalar.out           │            strcspn.mula.out            │
        │          B/s          │     B/s       vs base                │      B/s       vs base                │      B/s       vs base                 │
Short0            2293.3Mi ± 1%   840.4Mi ± 0%  -63.36% (p=0.000 n=20)   1570.2Mi ± 0%  -31.53% (p=0.000 n=20)   2287.2Mi ± 1%         ~ (p=0.925 n=20)
Mid0              10.489Gi ± 0%   1.487Gi ± 0%  -85.83% (p=0.000 n=20)    5.594Gi ± 1%  -46.67% (p=0.000 n=20)   10.481Gi ± 0%         ~ (p=0.718 n=20)
Long0             52.266Gi ± 0%   2.402Gi ± 0%  -95.40% (p=0.000 n=20)   14.654Gi ± 0%  -71.96% (p=0.000 n=20)   52.260Gi ± 0%         ~ (p=0.355 n=20)
Short1            1878.1Mi ± 0%   831.1Mi ± 0%  -55.75% (p=0.000 n=20)   1611.2Mi ± 0%  -14.21% (p=0.000 n=20)   1906.5Mi ± 0%    +1.51% (p=0.000 n=20)
Mid1               8.839Gi ± 0%   1.452Gi ± 0%  -83.57% (p=0.000 n=20)    5.643Gi ± 0%  -36.16% (p=0.000 n=20)    8.811Gi ± 0%    -0.31% (p=0.000 n=20)
Long1             35.327Gi ± 0%   2.403Gi ± 0%  -93.20% (p=0.000 n=20)   10.551Gi ± 0%  -70.13% (p=0.000 n=20)   35.351Gi ± 0%         ~ (p=0.231 n=20)
Short5            1814.4Mi ± 0%   722.3Mi ± 0%  -60.19% (p=0.000 n=20)    977.5Mi ± 0%  -46.12% (p=0.000 n=20)    854.2Mi ± 1%   -52.92% (p=0.000 n=20)
Mid5               8.016Gi ± 0%   1.367Gi ± 0%  -82.95% (p=0.000 n=20)    2.138Gi ± 0%  -73.33% (p=0.000 n=20)    3.930Gi ± 1%   -50.98% (p=0.000 n=20)
Long5             19.309Gi ± 0%   2.325Gi ± 2%  -87.96% (p=0.000 n=20)    3.299Gi ± 0%  -82.91% (p=0.000 n=20)   16.555Gi ± 0%   -14.26% (p=0.000 n=20)
Short20           1790.3Mi ± 0%   373.4Mi ± 0%  -79.15% (p=0.000 n=20)    767.1Mi ± 0%  -57.15% (p=0.000 n=20)    495.5Mi ± 0%   -72.32% (p=0.000 n=20)
Mid20             5977.2Mi ± 0%   905.9Mi ± 0%  -84.84% (p=0.000 n=20)   1830.6Mi ± 0%  -69.37% (p=0.000 n=20)   1768.1Mi ± 0%   -70.42% (p=0.000 n=20)
Long20             9.670Gi ± 0%   1.828Gi ± 0%  -81.09% (p=0.000 n=20)    3.308Gi ± 0%  -65.79% (p=0.000 n=20)   17.117Gi ± 0%   +77.01% (p=0.000 n=20)
Short40            668.5Mi ± 2%   211.0Mi ± 0%  -68.43% (p=0.000 n=20)    607.6Mi ± 1%   -9.11% (p=0.000 n=20)    290.9Mi ± 0%   -56.49% (p=0.000 n=20)
Mid40             1651.1Mi ± 1%   600.7Mi ± 0%  -63.62% (p=0.000 n=20)   1597.8Mi ± 0%   -3.23% (p=0.000 n=20)   1030.0Mi ± 0%   -37.62% (p=0.000 n=20)
Long40             3.309Gi ± 0%   2.176Gi ± 0%  -34.23% (p=0.000 n=20)    3.298Gi ± 0%   -0.33% (p=0.000 n=20)   16.474Gi ± 0%  +397.92% (p=0.000 n=20)
geomean            5.265Gi        1.073Gi       -79.62%                   2.524Gi       -52.07%                   4.242Gi        -19.44%

os: Linux
arch: x86_64
cpu:
        │ strcspn.glibc.out │
        │      sec/op       │
Short0          49.97µ ± 0%
Mid0            13.48µ ± 0%
Long0           1.033µ ± 0%
Short1          70.64µ ± 0%
Mid1            21.18µ ± 0%
Long1           7.920µ ± 0%
Short5          70.53µ ± 0%
Mid5            21.18µ ± 0%
Long5           7.920µ ± 0%
Short20         201.2µ ± 0%
Mid20           78.00µ ± 0%
Long20          35.65µ ± 0%
Short40         253.6µ ± 0%
Mid40           92.19µ ± 0%
Long40          35.68µ ± 0%
geomean         32.40µ

        │ strcspn.glibc.out │
        │        B/s        │
Short0         2.330Gi ± 0%
Mid0           8.639Gi ± 0%
Long0          112.7Gi ± 0%
Short1         1.648Gi ± 0%
Mid1           5.496Gi ± 0%
Long1          14.70Gi ± 0%
Short5         1.651Gi ± 0%
Mid5           5.497Gi ± 0%
Long5          14.70Gi ± 0%
Short20        592.6Mi ± 0%
Mid20          1.493Gi ± 0%
Long20         3.266Gi ± 0%
Short40        470.0Mi ± 0%
Mid40          1.263Gi ± 0%
Long40         3.262Gi ± 0%
geomean        3.593Gi

Sponsored by: The FreeBSD Foundation

Test Plan

passes new unit tests added for this purpose.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 53248
Build 50139: arc lint + arc unit

Event Timeline

fuz requested review of this revision.Aug 22 2023, 11:27 PM
This revision is now accepted and ready to land.Sep 8 2023, 8:10 PM