Page MenuHomeFreeBSD

lib/libc/amd64/string/strcmp.S: add baseline implementation
ClosedPublic

Authored by fuz on Sep 25 2023, 6:46 AM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 14, 4:30 AM
Unknown Object (File)
Tue, Nov 12, 3:25 AM
Unknown Object (File)
Sun, Nov 10, 10:38 PM
Unknown Object (File)
Oct 4 2024, 7:20 AM
Unknown Object (File)
Oct 2 2024, 7:13 PM
Unknown Object (File)
Oct 2 2024, 7:13 PM
Unknown Object (File)
Oct 2 2024, 7:13 PM
Unknown Object (File)
Oct 2 2024, 7:00 PM

Details

Summary

This is the most complicated one so far. The basic idea is to process
the bulk of the string in aligned blocks of 16 bytes such that one
string runs ahead and the other runs behind. The string that runs ahead
is checked for NUL bytes, the one that runs behind is compared with the
corresponding chunk of the string that runs ahead. This trades an extra
load per iteration for the very complicated block-reassembly needed in
the other implementations (bionic, glibc). On the flip side, we need
two code paths depending on the relative alignment of the two buffers.

The initial part of the string is compared directly if it is known not
to cross a page boundary. Otherwise, a complex slow path to avoid
crossing into unmapped memory commences.

Performance-wise we beat bionic for misaligned strings (i.e. the strings
do not share an alignment offset) and reach comparable performance for
aligned strings. glibc is a bit better as it has a special kernel for
AVX-512, where this stuff is a bit easier to do.

See the strperf repo for benchmarks:

os: FreeBSD
arch: amd64
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
                     │ strcmp.scalar.out │         strcmp.baseline.out         │          strcmp.bionic.out          │
                     │      sec/op       │   sec/op     vs base                │   sec/op     vs base                │
StrcmpShortAligned          122.09µ ± 2%   49.75µ ± 0%  -59.25% (p=0.000 n=20)   45.21µ ± 1%  -62.97% (p=0.000 n=20)
StrcmpMidAligned             28.27µ ± 1%   11.97µ ± 0%  -57.64% (p=0.000 n=20)   10.06µ ± 0%  -64.42% (p=0.000 n=20)
StrcmpLongAligned            9.227µ ± 0%   4.255µ ± 0%  -53.89% (p=0.000 n=20)   4.554µ ± 0%  -50.64% (p=0.000 n=20)
StrcmpShortUnaligned        145.67µ ± 1%   53.95µ ± 1%  -62.96% (p=0.000 n=20)   89.16µ ± 1%  -38.80% (p=0.000 n=20)
StrcmpMidUnaligned           83.76µ ± 0%   11.94µ ± 0%  -85.75% (p=0.000 n=20)   22.63µ ± 0%  -72.98% (p=0.000 n=20)
StrcmpLongUnaligned         68.025µ ± 0%   4.294µ ± 0%  -93.69% (p=0.000 n=20)   5.308µ ± 0%  -92.20% (p=0.000 n=20)
geomean                      54.58µ        13.83µ       -74.65%                  16.76µ       -69.29%

                     │ strcmp.scalar.out │           strcmp.baseline.out           │            strcmp.bionic.out            │
                     │        B/s        │      B/s       vs base                  │      B/s       vs base                  │
StrcmpShortAligned          976.4Mi ± 2%   2396.2Mi ± 0%   +145.40% (p=0.000 n=20)   2637.0Mi ± 1%   +170.07% (p=0.000 n=20)
StrcmpMidAligned            4.119Gi ± 1%    9.724Gi ± 0%   +136.10% (p=0.000 n=20)   11.577Gi ± 0%   +181.09% (p=0.000 n=20)
StrcmpLongAligned           12.62Gi ± 0%    27.36Gi ± 0%   +116.86% (p=0.000 n=20)    25.56Gi ± 0%   +102.60% (p=0.000 n=20)
StrcmpShortUnaligned        818.3Mi ± 1%   2209.6Mi ± 1%   +170.01% (p=0.000 n=20)   1337.1Mi ± 1%    +63.39% (p=0.000 n=20)
StrcmpMidUnaligned          1.390Gi ± 0%    9.753Gi ± 0%   +601.71% (p=0.000 n=20)    5.144Gi ± 0%   +270.12% (p=0.000 n=20)
StrcmpLongUnaligned         1.711Gi ± 0%   27.113Gi ± 0%  +1484.30% (p=0.000 n=20)   21.931Gi ± 0%  +1181.47% (p=0.000 n=20)
geomean                     2.133Gi         8.416Gi        +294.54%                   6.946Gi        +225.62%

os: Linux
arch: x86_64
cpu:
                     │ strcmp.glibc.out │
                     │      sec/op      │
StrcmpShortAligned          36.57µ ± 4%
StrcmpMidAligned            13.96µ ± 1%
StrcmpLongAligned           2.324µ ± 0%
StrcmpShortUnaligned        41.27µ ± 3%
StrcmpMidUnaligned          20.77µ ± 1%
StrcmpLongUnaligned         2.324µ ± 1%
geomean                     11.54µ

                     │ strcmp.glibc.out │
                     │       B/s        │
StrcmpShortAligned         3.183Gi ± 4%
StrcmpMidAligned           8.338Gi ± 1%
StrcmpLongAligned          50.10Gi ± 0%
StrcmpShortUnaligned       2.821Gi ± 3%
StrcmpMidUnaligned         5.604Gi ± 1%
StrcmpLongUnaligned        50.08Gi ± 1%
geomean                    10.09Gi

Sponsored by: The FreeBSD Foundation

Test Plan

Passes extended unit tests from D41970.
No new ATF test suite failures could be observed elsewhere.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

fuz requested review of this revision.Sep 25 2023, 6:46 AM

Function passes glibc test suite. Or so it seems; the glibc test suite is with as everything GNU touches, complicated and hard to understand. It is not clear to me if I actually managed to hack in my own strcmp, but it seems like it has worked.

In D41971#957467, @fuz wrote:

Function passes glibc test suite. Or so it seems; the glibc test suite is with as everything GNU touches, complicated and hard to understand. It is not clear to me if I actually managed to hack in my own strcmp, but it seems like it has worked.

Did you changed glibc strcmp() implementation with your own? If yes, add INT3 somewhere, to see that it was hit.

The glibc has four different strcmp implementations. I patched in my own strcmp as strcmp by replacing the object file before linking the shared object libc.so.6 and it seems to be hit during execution (at least the debugger says so). However, it is not clear to me from reading the code if that function is actually tested or if it's just the individual implementations. It's macro hell all the way down and I'm not sure how to patch in my own function correctly.

I have not yet tested this on a really baseline AMD64 processor but feel today, Christmas, is a great day to do silly stuff.

I hope to report back some test results from a really really dirt low baseline AMD64 type machine :

Copyright (c) 1992-2023 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 15.0-CURRENT #0 main-n266973-ca39f23347e1: Sat Dec 16 04:03:34 UTC 2023
    root@releng3.nyi.freebsd.org:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
FreeBSD clang version 17.0.6 (https://github.com/llvm/llvm-project.git llvmorg-17.0.6-0-g6009708b4367)
WARNING: WITNESS option enabled, expect reduced performance.
VT(vga): resolution 640x480
CPU: AMD E-450 APU with Radeon(tm) HD Graphics (1645.96-MHz K8-class CPU)
  Origin="AuthenticAMD"  Id=0x500f20  Family=0x14  Model=0x2  Stepping=0
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x802209<SSE3,MON,SSSE3,CX16,POPCNT>
  AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
  AMD Features2=0x35ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,IBS,SKINIT,WDT>
  SVM: NP,NRIP,NAsids=8
  TSC: P-state invariant, performance statistics
real memory  = 6442450944 (6144 MB)
avail memory = 5772414976 (5505 MB)
Event timer "LAPIC" quality 100
ACPI APIC Table: <INSYDE IS CRB  >
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
.
.
.
etc etc etc

Looking at the above we see a processor that is at least twelve years old and has very few fancy opcode features.

This should be interesting.

Dennis