The reimplementation is a bit cleaner than the original code,
although it is also slightly slower. This shouldn't matter too
much as we have asm code for the major platforms.
Optimised implementations are provided for amd64 and aarch64.
For amd64, we have three implementations. One for baseline,
one using ANDN from BMI1 and one using AVX-512 (though it's not
really vectorised). Here's the performance:
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz: pre 17.0s (602 MB/s) generic 18.8s (545 MB/s) scalar 13.4s (764 MB/s) bmi1 12.0s (853 MB/s) avx512 10.6s (966 MB/s) ARM Cortex-X1C (Windows 2023 Dev Kit perf core): pre 35.2s (291 MB/s) generic 36.4s (281 MB/s) scalar 34.5s (297 MB/s) ARM Cortex-A78C (Windows 2023 Dev Kit efficiency core): pre 46.8s (219 MB/s) generic 47.3s (216 MB/s) scalar 44.5s (230 MB/s)
This changeset will have to be reworked when D34497 lands.
I'm not sure how to apply the SIMD code to all uses of MD5.
This changeset anticipates D34498 and no longer provides the
transform and block symbols.
Obtained from: https://github.com/animetosho/md5-optimisation/