A port of the amd64 implementation (see D41696) with some slight changes due to
differences in instructions provided by aarch64.
No ASIMD for the same reason as the amd64 code: it's just not particularly
suitable for this application.
Event: EuroBSDcon 2024
Please review to ensure that this function fulfills the required constant time
properties. @andrew and @cpercival have agreed to do a joint review of the code
during EuroBSDcon 2024.
We have considered adding a wrapper that would set the DIT (data-independent
timing) bit before the code and reset it to its prior state after, but after
discussion with @imp and others have decided to leave this setting to a future
portable function (i.e. the caller is responsible for enabling DIT mode if
desired).
For benchmarks see D46757.