lib/libc/amd64/string/stpcpy.S: add baseline implementation
This commit adds a baseline implementation of stpcpy(3) for amd64.
It performs quite well in comparison to the previous scalar implementation
as well as agains bionic and glibc (though glibc is faster for very long
strings). Fiddle with the Makefile to also have strcpy(3) call into the
optimised stpcpy(3) code, fixing an oversight from D9841.
Sponsored by: The FreeBSD Foundation
Reviewed by: imp ngie emaste
Approved by: mjg kib
Fixes: D9841
Differential Revision: https://reviews.freebsd.org/D41349