lib/libc/aarch64/string: add strcmp SIMD implementation
This changeset includes a port of the SIMD implementation of strcmp
for amd64 to Aarch64.
Below is a description of its method as described in D41971.
The basic idea is to process the bulk of the string in aligned
blocks of 16 bytes such that one string runs ahead and the other
runs behind. The string that runs ahead is checked for NUL bytes,
the one that runs behind is compared with the corresponding chunk
of the string that runs ahead. This trades an extra load per
iteration for the very complicated block-reassembly needed in the
other implementations (bionic, glibc). On the flip side, we need
two code paths depending on the relative alignment of the two
buffers.
The initial part of the string is compared directly if it is known
not to cross a page boundary. Otherwise, a complex slow path to
avoid crossing into unmapped memory commences.
Performance is better in most cases than the existing
implementation from the Arm Optimized Routines repository.
See the DR for benchmark results.
Tested by: fuz (exprun)
Reviewed by: fuz, emaste
Sponsored by: Google LLC (GSoC 2024)
PR: 281175
Differential Revision: https://reviews.freebsd.org/D45839