Page MenuHomeFreeBSD

SIMD-enhanced strchrnul(3)
ClosedPublic

Authored by fuz on Aug 6 2023, 12:05 AM.
Tags
None
Referenced Files
F102144601: D41333.id125633.diff
Fri, Nov 8, 4:24 AM
Unknown Object (File)
Wed, Oct 16, 6:08 PM
Unknown Object (File)
Wed, Oct 16, 6:08 PM
Unknown Object (File)
Wed, Oct 16, 6:07 PM
Unknown Object (File)
Wed, Oct 16, 5:48 PM
Unknown Object (File)
Wed, Oct 16, 1:19 AM
Unknown Object (File)
Tue, Oct 15, 5:16 AM
Unknown Object (File)
Sun, Oct 13, 1:36 PM
Subscribers

Details

Summary

This DR adds a scalar and a baseline strchrnul(3) implementation for
amd64. This improves the performance of strchrnul(3), strchr(3), and
index(3). Benchmarks similar to those shown in D40693 show good results
(here "pre" refers to the generic C version used previously):

os: FreeBSD
arch: amd64
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
        │ strchrnul_pre.out │         strchrnul_scalar.out         │       strchrnul_baseline.out        │
        │      sec/op       │    sec/op     vs base                │   sec/op     vs base                │
Short          129.68µ ± 3%    59.91µ ± 1%  -53.80% (p=0.000 n=20)   44.37µ ± 1%  -65.79% (p=0.000 n=20)
Mid             21.15µ ± 0%    19.30µ ± 0%   -8.76% (p=0.000 n=20)   12.30µ ± 0%  -41.85% (p=0.000 n=20)
Long           13.772µ ± 0%   11.028µ ± 0%  -19.92% (p=0.000 n=20)   3.285µ ± 0%  -76.15% (p=0.000 n=20)
geomean         33.55µ         23.36µ       -30.37%                  12.15µ       -63.80%

        │ strchrnul_pre.out │          strchrnul_scalar.out          │         strchrnul_baseline.out         │
        │        B/s        │      B/s       vs base                 │      B/s       vs base                 │
Short          919.3Mi ± 3%   1989.7Mi ± 1%  +116.45% (p=0.000 n=20)   2686.8Mi ± 1%  +192.28% (p=0.000 n=20)
Mid            5.505Gi ± 0%    6.033Gi ± 0%    +9.60% (p=0.000 n=20)    9.466Gi ± 0%   +71.97% (p=0.000 n=20)
Long           8.453Gi ± 0%   10.557Gi ± 0%   +24.88% (p=0.000 n=20)   35.441Gi ± 0%  +319.26% (p=0.000 n=20)
geomean        3.470Gi         4.983Gi        +43.62%                   9.584Gi       +176.22%

The benchmarks always check strings that do not have the character we
are looking for. As the code does not distinguish a NUL match from
a match against the searched character, this should not make a difference
in the performance measured.

Piggybacking on this DR we also remove mentions of x86-64-v3 and v4
versions of strlen(3) which ended up not being committed in D40693.

Sponsored by: The FreeBSD Foundation

Test Plan

passes the test suite, leads to a stable system

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 52992
Build 49883: arc lint + arc unit

Event Timeline

fuz requested review of this revision.Aug 6 2023, 12:05 AM
fuz retitled this revision from Sponsored by: FreeBSD Foundation to SIMD-enhanced strchrnul(3).Aug 6 2023, 12:06 AM
fuz edited the summary of this revision. (Show Details)

so how does this bench against glibc?

In D41333#941221, @mjg wrote:

so how does this bench against glibc?

Here's the performance of strchrnul in an Ubuntu chroot on the same system:

        │ strchrnul_glibc.out │
        │       sec/op        │
Short             49.73µ ± 0%
Mid               14.60µ ± 0%
Long              1.237µ ± 0%
geomean           9.646µ

        │ strchrnul_glibc.out │
        │         B/s         │
Short            2.341Gi ± 0%
Mid              7.976Gi ± 0%
Long             94.14Gi ± 0%
geomean          12.07Gi

Observe that we beat glibc.

glibc result should be included, but also more specific -- this is clearly not a win across the board

regardless, priority is memset, memcpy, strcpy, strncpy and similar. strwhocares are optional targets.

This revision is now accepted and ready to land.Aug 6 2023, 1:25 PM

The faster glibc result is likely due to glibc switching to AVX or AVX-512 for longer strings. I will revisit this routine for AVX and AXV-512 once I am done with SSE for all routines.