Page MenuHomeFreeBSD

lib/libc/aarch64/string: add ASIMD-enhanced timingsafe_bcmp implementation
ClosedPublic

Authored by fuz on Sep 23 2024, 9:54 AM.
Tags
None
Referenced Files
F107276487: D46757.diff
Sat, Jan 11, 8:58 PM
Unknown Object (File)
Fri, Jan 10, 4:13 PM
Unknown Object (File)
Fri, Jan 10, 3:17 PM
Unknown Object (File)
Fri, Jan 3, 5:21 PM
Unknown Object (File)
Thu, Jan 2, 6:40 AM
Unknown Object (File)
Fri, Dec 20, 11:06 PM
Unknown Object (File)
Nov 21 2024, 11:47 AM
Unknown Object (File)
Nov 16 2024, 7:27 PM

Details

Summary

This adds an implementation of the timingsafe_bcmp function for AArch64
patterned after the implementation in D41673. Performance is quite nice:
(on Cortex X1, Windows 2023 Dev Kit; this benchmark also contains an upcoming
timingsafe_memcmp implementation).

              │ memcmp.A.pre.out │            memcmp.A.post.out             │
              │       B/s        │      B/s        vs base                  │
MemcmpShort        1.564Gi ±  1%    1.573Gi ±  1%          ~ (p=0.355 n=20)
MemcmpMid          3.118Gi ±  0%    3.118Gi ±  1%          ~ (p=0.512 n=20)
MemcmpLong         12.80Gi ±  1%    12.82Gi ±  0%          ~ (p=0.157 n=20)
BcmpShort          1.541Gi ± 14%    1.322Gi ± 14%          ~ (p=0.134 n=20)
BcmpMid            2.908Gi ±  1%    2.899Gi ±  1%          ~ (p=0.602 n=20)
BcmpLong           12.93Gi ±  1%    12.99Gi ±  1%          ~ (p=0.327 n=20)
TsBcmpShort        516.2Mi ±  0%    970.7Mi ±  3%    +88.06% (p=0.000 n=20)
TsBcmpMid          1.456Gi ±  2%    3.340Gi ±  1%   +129.37% (p=0.000 n=20)
TsBcmpLong         4.523Gi ±  0%   13.200Gi ±  1%   +191.83% (p=0.000 n=20)
TsMemcmpShort      284.4Mi ±  1%    841.5Mi ±  1%   +195.94% (p=0.000 n=20)
TsMemcmpMid        336.8Mi ±  0%   3177.0Mi ±  2%   +843.22% (p=0.000 n=20)
TsMemcmpLong       351.7Mi ±  0%   7350.1Mi ±  1%  +1990.09% (p=0.000 n=20)
geomean            1.639Gi          3.401Gi         +107.46%

Please review to ensure that this function fulfills the required constant time
properties. @andrew and @cpercival have agreed to do a joint review of the code
during EuroBSDcon 2024.

We have considered adding a wrapper that would set the DIT (data-independent
timing) bit before the code and reset it to its prior state after, but after
discussion with @imp and others have decided to leave this setting to a future
portable function (i.e. the caller is responsible for enabling DIT mode if
desired).

Event: EuroBSDcon 2024

Test Plan

passes our test suite; test suite does not test constant time
properties.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

fuz requested review of this revision.Sep 23 2024, 9:54 AM
fuz created this revision.

Looks good, simple and nice. :-)

This revision is now accepted and ready to land.Oct 21 2024, 6:10 PM

A review with hat security, i.e. by @andrew and @cperciva, is still needed to proceed here.

lib/libc/aarch64/string/timingsafe_bcmp.S
72

Question because I don't know arm64 assembly: ldr q0, ... loads 16 bytes into v0?

90

s/64/32/ ?

fuz marked 2 inline comments as done.Oct 26 2024, 12:30 AM

Reply to in-line comments.

lib/libc/aarch64/string/timingsafe_bcmp.S
72

Correct, q0 and v0 are the same register, just as a quadword scalar or as a vector.

90

At this point, we know that x2 > 32. We subtract 64 to subtract 32 bytes for the current iteration, as well as 32 bytes for the next iteration. If the size becomes negative (i.e. x2 < 64) we know that there aren't 32 bytes left to complete the next iteration and have to bail into the tail handling code. At the beginning of the next loop, x2 is always 32 lower than the number of remaining bytes.

lib/libc/aarch64/string/timingsafe_bcmp.S
90

Ah, I didn't realize that x2 was aiming to become negative and the final addition is to bring x0/x1 back to pointing into the buffer.

93

Does ldp advance the pointer? If not, how are we moving through the buffer as opposed to checking the same initial 32 bytes repeatedly?

fuz marked 4 inline comments as done.Oct 30 2024, 10:47 PM
fuz added inline comments.
lib/libc/aarch64/string/timingsafe_bcmp.S
93

Yes, [reg], #imm is post-indexed addressing. The effective address is [reg] and after memory access, reg is incremented by #imm. ARM64 also supports pre-increment (writeback) using a syntax like [reg, #imm]!, but that isn't used here.

Took me a while to understand how arm64 assembly language works, but I'm now convinced that this is correct.