Conceptually very similar to timingsafe_bcmp(), but with comparison
logic inspired by Elijah Stone's
fancy memcmp. A baseline (SSE) implementation
was omitted this time as I was not able to get it to perform adequately.
Best I got was 8% over the scalar version for long inputs, but slower for
short inputs.
Performance is solid, at about 10x of the generic C
implementation overall:
os: FreeBSD arch: amd64 cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz │ memcmp.pre.out │ memcmp.amd64.out │ │ sec/op │ sec/op vs base │ TsMemcmpShort 189.11µ ± 0% 55.85µ ± 0% -70.47% (p=0.000 n=20) TsMemcmpMid 146.47µ ± 0% 10.14µ ± 0% -93.08% (p=0.000 n=20) TsMemcmpLong 130.642µ ± 0% 6.608µ ± 0% -94.94% (p=0.000 n=20) geomean 153.5µ 15.52µ -89.89% │ memcmp.pre.out │ memcmp.amd64.out │ │ B/s │ B/s vs base │ TsMemcmpShort 630.4Mi ± 0% 2134.4Mi ± 0% +238.60% (p=0.000 n=20) TsMemcmpMid 813.9Mi ± 0% 11761.9Mi ± 0% +1345.11% (p=0.000 n=20) TsMemcmpLong 912.5Mi ± 0% 18039.2Mi ± 0% +1876.92% (p=0.000 n=20) geomean 776.5Mi 7.499Gi +888.99%
As with the timingsafe_bcmp implementation from D41673, care has been
taken to ensure that only instructions with data operand independent
timing from Intel's list have been used.
Sponsored by: The FreeBSD Foundation