Page MenuHomeFreeBSD

lib/libc/aarch64/string: add strcspn optimized implementation
AcceptedPublic

Authored by getz on Aug 21 2024, 3:10 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 13, 3:18 AM
Unknown Object (File)
Sun, Nov 10, 1:40 PM
Unknown Object (File)
Fri, Nov 8, 12:43 AM
Unknown Object (File)
Fri, Nov 8, 12:20 AM
Unknown Object (File)
Wed, Nov 6, 3:14 PM
Unknown Object (File)
Tue, Nov 5, 9:04 PM
Unknown Object (File)
Tue, Nov 5, 7:30 PM
Unknown Object (File)
Fri, Nov 1, 3:59 AM
Subscribers

Details

Reviewers
fuz
emaste
andrew
Summary

This is a port of the Scalar optimized variant of strcspn for amd64 to aarch64
It utilizes a LUT to speed up the function, a SIMD variant is still under
development.

Performance benchmarks are as usual generated by strperf

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ strcspnScalar │             strcspnSIMD             │
        │    sec/op     │   sec/op     vs base                │
Short0      241.7µ ± 0%   131.7µ ± 0%  -45.49% (p=0.000 n=20)
Mid0       145.39µ ± 0%   39.67µ ± 0%  -72.71% (p=0.000 n=20)
Long0     113.487µ ± 0%   4.438µ ± 0%  -96.09% (p=0.000 n=20)
Short1      246.2µ ± 0%   144.9µ ± 0%  -41.14% (p=0.000 n=20)
Mid1       146.43µ ± 0%   47.50µ ± 0%  -67.56% (p=0.000 n=20)
Long1     113.478µ ± 0%   6.594µ ± 0%  -94.19% (p=0.000 n=20)
Short5      297.5µ ± 0%   276.6µ ± 1%   -7.04% (p=0.000 n=20)
Mid5        161.6µ ± 0%   116.4µ ± 1%  -27.98% (p=0.000 n=20)
Long5      113.67µ ± 0%   68.12µ ± 1%  -40.07% (p=0.000 n=20)
Short20     522.0µ ± 0%   534.1µ ± 0%   +2.32% (p=0.000 n=20)
Mid20       225.9µ ± 0%   185.8µ ± 0%  -17.77% (p=0.000 n=20)
Long20     113.70µ ± 0%   68.21µ ± 0%  -40.00% (p=0.000 n=20)
Short40     828.1µ ± 0%   899.3µ ± 0%   +8.60% (p=0.000 n=20)
Mid40       312.4µ ± 0%   284.1µ ± 0%   -9.06% (p=0.000 n=20)
Long40     113.74µ ± 0%   68.28µ ± 1%  -39.96% (p=0.000 n=20)
geomean     200.9µ        91.70µ       -54.37%

        │ strcspnScalar │               strcspnSIMD               │
        │      B/s      │      B/s       vs base                  │
Short0     493.3Mi ± 0%    904.9Mi ± 0%    +83.45% (p=0.000 n=20)
Mid0       819.9Mi ± 0%   3004.7Mi ± 0%   +266.47% (p=0.000 n=20)
Long0      1.026Gi ± 0%   26.231Gi ± 0%  +2457.08% (p=0.000 n=20)
Short1     484.2Mi ± 0%    822.6Mi ± 0%    +69.90% (p=0.000 n=20)
Mid1       814.1Mi ± 0%   2509.8Mi ± 0%   +208.30% (p=0.000 n=20)
Long1      1.026Gi ± 0%   17.654Gi ± 0%  +1620.84% (p=0.000 n=20)
Short5     400.7Mi ± 0%    431.0Mi ± 1%     +7.58% (p=0.000 n=20)
Mid5       737.7Mi ± 0%   1024.4Mi ± 1%    +38.86% (p=0.000 n=20)
Long5      1.024Gi ± 0%    1.709Gi ± 1%    +66.87% (p=0.000 n=20)
Short20    228.4Mi ± 0%    223.2Mi ± 0%     -2.26% (p=0.000 n=20)
Mid20      527.6Mi ± 0%    641.6Mi ± 0%    +21.60% (p=0.000 n=20)
Long20     1.024Gi ± 0%    1.707Gi ± 0%    +66.68% (p=0.000 n=20)
Short40    144.0Mi ± 0%    132.6Mi ± 0%     -7.92% (p=0.000 n=20)
Mid40      381.6Mi ± 0%    419.7Mi ± 0%     +9.97% (p=0.000 n=20)
Long40     1.024Gi ± 0%    1.705Gi ± 1%    +66.57% (p=0.000 n=20)
geomean    593.2Mi         1.270Gi        +119.14%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ strcspnScalar │             strcspnSIMD             │
        │    sec/op     │   sec/op     vs base                │
Short0     172.46µ ± 1%   96.38µ ± 0%  -44.12% (p=0.000 n=20)
Mid0        97.96µ ± 0%   25.10µ ± 2%  -74.37% (p=0.000 n=20)
Long0      90.099µ ± 0%   3.031µ ± 1%  -96.64% (p=0.000 n=20)
Short1      178.9µ ± 1%   130.3µ ± 0%  -27.15% (p=0.000 n=20)
Mid1       100.51µ ± 1%   31.66µ ± 0%  -68.50% (p=0.000 n=20)
Long1      90.110µ ± 0%   4.536µ ± 0%  -94.97% (p=0.000 n=20)
Short5      229.0µ ± 1%   199.5µ ± 0%  -12.86% (p=0.000 n=20)
Mid5       113.85µ ± 0%   65.27µ ± 1%  -42.67% (p=0.000 n=20)
Long5       90.14µ ± 0%   36.74µ ± 0%  -59.24% (p=0.000 n=20)
Short20     397.3µ ± 0%   459.1µ ± 1%  +15.55% (p=0.000 n=20)
Mid20       163.4µ ± 0%   132.7µ ± 1%  -18.78% (p=0.000 n=20)
Long20      90.16µ ± 0%   36.96µ ± 1%  -59.01% (p=0.000 n=20)
Short40     638.1µ ± 0%   790.6µ ± 0%  +23.91% (p=0.000 n=20)
Mid40       238.6µ ± 0%   222.4µ ± 1%   -6.80% (p=0.000 n=20)
Long40      90.19µ ± 0%   36.96µ ± 0%  -59.02% (p=0.000 n=20)
geomean     150.6µ        62.93µ       -58.22%

        │ strcspnScalar │              strcspnSIMD               │
        │     MiB/s     │    MiB/s      vs base                  │
Short0       724.8 ± 1%    1297.0 ± 0%    +78.94% (p=0.000 n=20)
Mid0        1.276k ± 0%    4.980k ± 2%   +290.25% (p=0.000 n=20)
Long0       1.387k ± 0%   41.238k ± 1%  +2872.41% (p=0.000 n=20)
Short1       698.8 ± 1%     959.2 ± 0%    +37.27% (p=0.000 n=20)
Mid1        1.244k ± 1%    3.948k ± 0%   +217.45% (p=0.000 n=20)
Long1       1.387k ± 0%   27.557k ± 0%  +1886.50% (p=0.000 n=20)
Short5       545.9 ± 1%     626.5 ± 0%    +14.76% (p=0.000 n=20)
Mid5        1.098k ± 0%    1.915k ± 1%    +74.43% (p=0.000 n=20)
Long5       1.387k ± 0%    3.402k ± 0%   +145.35% (p=0.000 n=20)
Short20      314.6 ± 0%     272.3 ± 1%    -13.46% (p=0.000 n=20)
Mid20        765.2 ± 0%     942.1 ± 1%    +23.12% (p=0.000 n=20)
Long20      1.386k ± 0%    3.382k ± 1%   +143.94% (p=0.000 n=20)
Short40      195.9 ± 0%     158.1 ± 0%    -19.29% (p=0.000 n=20)
Mid40        523.9 ± 0%     562.1 ± 1%     +7.29% (p=0.000 n=20)
Long40      1.386k ± 0%    3.382k ± 0%   +144.03% (p=0.000 n=20)
geomean      829.9         1.986k        +139.35%

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A78C r0p0
        │ strcspnScalar │             strcspnSIMD             │
        │    sec/op     │   sec/op     vs base                │
Short0      335.0µ ± 0%   174.1µ ± 0%  -48.03% (p=0.000 n=20)
Mid0       199.72µ ± 1%   53.89µ ± 0%  -73.02% (p=0.000 n=20)
Long0     169.648µ ± 0%   5.949µ ± 0%  -96.49% (p=0.000 n=20)
Short1      339.7µ ± 0%   231.1µ ± 0%  -31.97% (p=0.000 n=20)
Mid1       200.14µ ± 0%   68.14µ ± 0%  -65.95% (p=0.000 n=20)
Long1      169.65µ ± 0%   10.07µ ± 0%  -94.06% (p=0.000 n=20)
Short5      389.4µ ± 0%   304.6µ ± 1%  -21.78% (p=0.000 n=20)
Mid5        215.5µ ± 0%   123.6µ ± 1%  -42.67% (p=0.000 n=20)
Long5      169.64µ ± 0%   75.37µ ± 0%  -55.57% (p=0.000 n=20)
Short20     686.7µ ± 0%   356.2µ ± 1%  -48.12% (p=0.000 n=20)
Mid20       314.8µ ± 0%   136.2µ ± 0%  -56.73% (p=0.000 n=20)
Long20     169.66µ ± 0%   75.30µ ± 0%  -55.62% (p=0.000 n=20)
Short40    1187.6µ ± 0%   440.5µ ± 1%  -62.91% (p=0.000 n=20)
Mid40       458.0µ ± 0%   161.0µ ± 0%  -64.85% (p=0.000 n=20)
Long40     169.79µ ± 0%   75.27µ ± 0%  -55.67% (p=0.000 n=20)
geomean     284.0µ        95.35µ       -66.43%

        │ strcspnScalar │               strcspnSIMD                │
        │      B/s      │      B/s        vs base                  │
Short0     355.8Mi ± 0%     684.7Mi ± 0%    +92.42% (p=0.000 n=20)
Mid0       596.9Mi ± 1%    2211.9Mi ± 0%   +270.58% (p=0.000 n=20)
Long0      702.7Mi ± 0%   20038.6Mi ± 0%  +2751.73% (p=0.000 n=20)
Short1     350.9Mi ± 0%     515.8Mi ± 0%    +47.00% (p=0.000 n=20)
Mid1       595.6Mi ± 0%    1749.4Mi ± 0%   +193.70% (p=0.000 n=20)
Long1      702.7Mi ± 0%   11833.8Mi ± 0%  +1584.08% (p=0.000 n=20)
Short5     306.1Mi ± 0%     391.3Mi ± 1%    +27.84% (p=0.000 n=20)
Mid5       553.1Mi ± 0%     964.8Mi ± 1%    +74.43% (p=0.000 n=20)
Long5      702.7Mi ± 0%    1581.6Mi ± 0%   +125.08% (p=0.000 n=20)
Short20    173.6Mi ± 0%     334.7Mi ± 1%    +92.77% (p=0.000 n=20)
Mid20      378.7Mi ± 0%     875.1Mi ± 0%   +131.08% (p=0.000 n=20)
Long20     702.6Mi ± 0%    1583.1Mi ± 0%   +125.31% (p=0.000 n=20)
Short40    100.4Mi ± 0%     270.6Mi ± 1%   +169.61% (p=0.000 n=20)
Mid40      260.3Mi ± 0%     740.5Mi ± 0%   +184.52% (p=0.000 n=20)
Long40     702.1Mi ± 0%    1583.8Mi ± 0%   +125.58% (p=0.000 n=20)
geomean    419.7Mi          1.221Gi        +197.86%
Test Plan

Passes all tests in the test suite

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 59154
Build 56041: arc lint + arc unit

Event Timeline

getz requested review of this revision.Aug 21 2024, 3:10 PM

Same hints as for strspn from D46396 apply. Code looks reasonable.

lib/libc/aarch64/string/strcspn.S
96

We already have 256 bytes allocated on the stack. No need to decrement sp further, just write into that area.

  • Update based on review

Uses two simd stores for zeroing the table, idea is that most aarch64 uarches have atleast two SIMD execution pipelines.
Benchmarks showed that this was the optimal arrangement.

os: FreeBSD
arch: arm64
cpu: ARM Cortex-A76 r4p1
        │ strcspnScalar │             strcspnSIMD             │
        │    sec/op     │   sec/op     vs base                │
Short0      241.7µ ± 0%   131.2µ ± 0%  -45.72% (p=0.000 n=20)
Mid0       145.39µ ± 0%   39.42µ ± 0%  -72.89% (p=0.000 n=20)
Long0     113.487µ ± 0%   4.435µ ± 0%  -96.09% (p=0.000 n=20)
Short1      246.2µ ± 0%   150.8µ ± 0%  -38.74% (p=0.000 n=20)
Mid1       146.43µ ± 0%   48.01µ ± 0%  -67.21% (p=0.000 n=20)
Long1     113.478µ ± 0%   6.596µ ± 0%  -94.19% (p=0.000 n=20)
Short5      297.5µ ± 0%   266.9µ ± 0%  -10.31% (p=0.000 n=20)
Mid5        161.6µ ± 0%   114.0µ ± 0%  -29.45% (p=0.000 n=20)
Long5      113.67µ ± 0%   66.11µ ± 0%  -41.84% (p=0.000 n=20)
Short20     522.0µ ± 0%   519.7µ ± 1%        ~ (p=0.201 n=20)
Mid20       225.9µ ± 0%   182.5µ ± 0%  -19.23% (p=0.000 n=20)
Long20     113.70µ ± 0%   66.15µ ± 0%  -41.82% (p=0.000 n=20)
Short40     828.1µ ± 0%   868.5µ ± 1%   +4.88% (p=0.000 n=20)
Mid40       312.4µ ± 0%   276.3µ ± 1%  -11.55% (p=0.000 n=20)
Long40     113.74µ ± 0%   66.17µ ± 0%  -41.82% (p=0.000 n=20)
geomean     200.9µ        90.38µ       -55.02%

        │ strcspnScalar │               strcspnSIMD               │
        │      B/s      │      B/s       vs base                  │
Short0     493.3Mi ± 0%    908.7Mi ± 0%    +84.22% (p=0.000 n=20)
Mid0       819.9Mi ± 0%   3024.2Mi ± 0%   +268.85% (p=0.000 n=20)
Long0      1.026Gi ± 0%   26.246Gi ± 0%  +2458.62% (p=0.000 n=20)
Short1     484.2Mi ± 0%    790.3Mi ± 0%    +63.23% (p=0.000 n=20)
Mid1       814.1Mi ± 0%   2483.1Mi ± 0%   +205.01% (p=0.000 n=20)
Long1      1.026Gi ± 0%   17.650Gi ± 0%  +1620.51% (p=0.000 n=20)
Short5     400.7Mi ± 0%    446.7Mi ± 0%    +11.49% (p=0.000 n=20)
Mid5       737.7Mi ± 0%   1045.7Mi ± 0%    +41.75% (p=0.000 n=20)
Long5      1.024Gi ± 0%    1.761Gi ± 0%    +71.95% (p=0.000 n=20)
Short20    228.4Mi ± 0%    229.4Mi ± 1%          ~ (p=0.201 n=20)
Mid20      527.6Mi ± 0%    653.2Mi ± 0%    +23.80% (p=0.000 n=20)
Long20     1.024Gi ± 0%    1.760Gi ± 0%    +71.89% (p=0.000 n=20)
Short40    144.0Mi ± 0%    137.3Mi ± 1%     -4.64% (p=0.000 n=20)
Mid40      381.6Mi ± 0%    431.5Mi ± 1%    +13.07% (p=0.000 n=20)
Long40     1.024Gi ± 0%    1.759Gi ± 0%    +71.88% (p=0.000 n=20)
geomean    593.2Mi         1.288Gi        +122.33%

os: FreeBSD
arch: arm64
cpu: ARM Neoverse-V1 r1p1
        │ strcspnScalar │             strcspnSIMD             │
        │    sec/op     │   sec/op     vs base                │
Short0     172.46µ ± 1%   96.05µ ± 0%  -44.30% (p=0.000 n=20)
Mid0        97.96µ ± 0%   25.04µ ± 2%  -74.44% (p=0.000 n=20)
Long0      90.099µ ± 0%   2.993µ ± 1%  -96.68% (p=0.000 n=20)
Short1      178.9µ ± 1%   133.5µ ± 0%  -25.37% (p=0.000 n=20)
Mid1       100.51µ ± 1%   31.88µ ± 0%  -68.28% (p=0.000 n=20)
Long1      90.110µ ± 0%   4.543µ ± 0%  -94.96% (p=0.000 n=20)
Short5      229.0µ ± 1%   197.0µ ± 0%  -13.95% (p=0.000 n=20)
Mid5       113.85µ ± 0%   63.94µ ± 1%  -43.84% (p=0.000 n=20)
Long5       90.14µ ± 0%   36.95µ ± 1%  -59.01% (p=0.000 n=20)
Short20     397.3µ ± 0%   447.1µ ± 1%  +12.53% (p=0.000 n=20)
Mid20       163.4µ ± 0%   130.5µ ± 1%  -20.10% (p=0.000 n=20)
Long20      90.16µ ± 0%   36.89µ ± 1%  -59.09% (p=0.000 n=20)
Short40     638.1µ ± 0%   780.3µ ± 0%  +22.29% (p=0.000 n=20)
Mid40       238.6µ ± 0%   218.7µ ± 0%   -8.34% (p=0.000 n=20)
Long40      90.19µ ± 0%   37.03µ ± 0%  -58.94% (p=0.000 n=20)
geomean     150.6µ        62.57µ       -58.46%

        │ strcspnScalar │              strcspnSIMD               │
        │     MiB/s     │    MiB/s      vs base                  │
Short0       724.8 ± 1%    1301.3 ± 0%    +79.54% (p=0.000 n=20)
Mid0        1.276k ± 0%    4.993k ± 2%   +291.28% (p=0.000 n=20)
Long0       1.387k ± 0%   41.765k ± 1%  +2910.41% (p=0.000 n=20)
Short1       698.8 ± 1%     936.2 ± 0%    +33.99% (p=0.000 n=20)
Mid1        1.244k ± 1%    3.920k ± 0%   +215.24% (p=0.000 n=20)
Long1       1.387k ± 0%   27.515k ± 0%  +1883.49% (p=0.000 n=20)
Short5       545.9 ± 1%     634.4 ± 0%    +16.21% (p=0.000 n=20)
Mid5        1.098k ± 0%    1.955k ± 1%    +78.06% (p=0.000 n=20)
Long5       1.387k ± 0%    3.383k ± 1%   +143.98% (p=0.000 n=20)
Short20      314.6 ± 0%     279.6 ± 1%    -11.13% (p=0.000 n=20)
Mid20        765.2 ± 0%     957.7 ± 1%    +25.16% (p=0.000 n=20)
Long20      1.386k ± 0%    3.389k ± 1%   +144.43% (p=0.000 n=20)
Short40      195.9 ± 0%     160.2 ± 0%    -18.23% (p=0.000 n=20)
Mid40        523.9 ± 0%     571.6 ± 0%     +9.10% (p=0.000 n=20)
Long40      1.386k ± 0%    3.376k ± 0%   +143.57% (p=0.000 n=20)
geomean      829.9         1.998k        +140.73%

exp-run says it's fine.

This revision is now accepted and ready to land.Wed, Nov 6, 2:23 PM