This commit adds a baseline implementation of stpcpy(3) for amd64.
It performs quite well in comparison to the previous scalar implementation
as well as agains bionic and glibc (though glibc is faster for very long
strings).
Fiddle with the Makefile to also have strcpy(3) call into the optimised
stpcpy(3) code. Extend the strcpy(3) test case from NetBSD's test suite
to cover longer strings, which was needed to catch some bugs. Also make
it so the test case can be executed on a custom stprcpy() instead of
hardcoding the one shipped in libc.
Document the new kernel in simd(7).
As per previous discussion in D40693, I've left out the Foundation
copyright in the NetBSD test suite bits. @jrm @imp, please let me know
if this is correct. For stpcpy.S, I added the copyright as @mjg
previously indicated that he is fine with this.
Benchmarks available online:
os: FreeBSD arch: amd64 cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz │ stpcpy_baseline.out │ stpcpy_bionic.out │ stpcpy_scalar.out │ │ sec/op │ sec/op vs base │ sec/op vs base │ Short 77.93µ ± 2% 100.12µ ± 0% +28.46% (p=0.000 n=20) 80.30µ ± 1% +3.04% (p=0.000 n=20) Mid 13.01µ ± 1% 25.29µ ± 1% +94.46% (p=0.000 n=20) 36.41µ ± 0% +179.96% (p=0.000 n=20) Long 3.032µ ± 0% 3.084µ ± 12% ~ (p=0.289 n=20) 33.099µ ± 0% +991.70% (p=0.000 n=20) geomean 14.54µ 19.84µ +36.45% 45.91µ +215.79% │ stpcpy_baseline.out │ stpcpy_bionic.out │ stpcpy_scalar.out │ │ B/s │ B/s vs base │ B/s vs base │ Short 1.494Gi ± 2% 1.163Gi ± 0% -22.16% (p=0.000 n=20) 1.450Gi ± 1% -2.95% (p=0.000 n=20) Mid 8.951Gi ± 1% 4.603Gi ± 1% -48.57% (p=0.000 n=20) 3.197Gi ± 0% -64.28% (p=0.000 n=20) Long 38.397Gi ± 0% 37.754Gi ± 13% ~ (p=0.289 n=20) 3.517Gi ± 0% -90.84% (p=0.000 n=20) geomean 8.007Gi 5.868Gi -26.71% 2.536Gi -68.33% os: Linux arch: x86_64 cpu: │ stpcpy_glibc.out │ │ sec/op │ Short 93.23µ ± 0% Mid 16.61µ ± 1% Long 2.623µ ± 0% geomean 15.96µ │ stpcpy_glibc.out │ │ B/s │ Short 1.249Gi ± 0% Mid 7.008Gi ± 1% Long 44.38Gi ± 0% geomean 7.296Gi
Sponsored by: The FreeBSD Foundation