amd64: patch ffsl to use the compiler builtin
This shortens fdalloc by over 60 bytes. Correctness verified by running both
variants at the same time and comparing the result of each call.
Note someone(tm) should make a pass at converting everything else feasible.