amd64: depessimize bcmp for small buffers
Adapt assembly generated by clang for memcmp and use it for <= 64 sized
compares (which are the vast majority).
Sample result of doing stats on Broadwell (% of samples):
before: 4.0 kernel bcmp cache_lookup
after : 0.7 kernel bcmp cache_lookup
The routine is most definitely still not optimal. Anyone interested in
spending time improving it is welcome to take over.
Reviewed by: kib