amd64: implement strlen in assembly
Tested with glibc test suite and a custom test which can be found in the
review.
The C variant in libkern performs excessive branching to find the
zero byte instead of using the bsfq instruction. The same code
patched to use it is still slower than the routine implemented here
as the compiler keeps neglecting to perform certain optimizations
(like using leaq).
On top of that the routine can is a starting point for copyinstr
which operates on words instead of bytes.
The previous attempt had an instance of swapped operands to
andq when dealing with fully aligned case, which had a side effect
of breaking the code for certain corner cases. Noted by jrtc27.
Sample results:
$(perl -e "print 'A' x 3"):
stock: 211198039
patched:338626619
asm: 465609618
$(perl -e "print 'A' x 100"):
stock: 83151997
patched: 98285919
asm: 120719888