Page MenuHomeFreeBSD

Use a builtin where possible in msun
ClosedPublic

Authored by andrew on Nov 2 2021, 11:51 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Nov 9, 12:52 AM
Unknown Object (File)
Wed, Nov 6, 6:04 AM
Unknown Object (File)
Wed, Nov 6, 6:01 AM
Unknown Object (File)
Wed, Nov 6, 5:57 AM
Unknown Object (File)
Wed, Nov 6, 5:42 AM
Unknown Object (File)
Wed, Nov 6, 5:36 AM
Unknown Object (File)
Wed, Nov 6, 3:04 AM
Unknown Object (File)
Sep 23 2024, 10:08 PM
Subscribers

Details

Summary

Some of the functions in msun can be implemented using a compiler
builtin function to generate a small number of instructions. Implement
this support in fma, fmax, fmin, and sqrt on arm64.

Care must be taken as the builtin can be implemented as a function
call on some architectures that lack direct support. In these cases
we need to use the original code path.

Test Plan

All the fma/fmaf tests pass (with unrelated fmal failures removed)

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

What about fmal?

Seems that the common implementations don't generate well-optimized code on any platform. RISC-V would benefit similarly from a change identical to this.

On amd64 we can't use a builtin, but the existing fma common implementation generates:

0000000000000000 <fma>:
       0: 55                            pushq   %rbp
       1: 48 89 e5                      movq    %rsp, %rbp
       4: 41 56                         pushq   %r14
       6: 53                            pushq   %rbx
       7: 48 83 ec 60                   subq    $96, %rsp
       b: 48 8b 05 00 00 00 00          movq    (%rip), %rax  # 12 <fma+0x12>
      12: 48 89 45 e8                   movq    %rax, -24(%rbp)
      16: 66 0f 57 db                   xorpd   %xmm3, %xmm3
      1a: 66 0f 2e c3                   ucomisd %xmm3, %xmm0
      1e: 0f 9b c0                      setnp   %al
      21: 0f 94 c1                      sete    %cl
      24: 84 c1                         testb   %al, %cl
      26: 75 0e                         jne     0x36 <fma+0x36>

      ...

     495: 30 ca                         xorb    %cl, %dl
     497: 75 21                         jne     0x4ba <fma+0x4ba>
     499: 66 0f 50 cb                   movmskpd        %xmm3, %ecx
     49d: 48 c1 e1 3f                   shlq    $63, %rcx
     4a1: 48 31 c1                      xorq    %rax, %rcx
     4a4: 48 c1 e9 3e                   shrq    $62, %rcx
     4a8: 83 e1 fe                      andl    $-2, %ecx
     4ab: 48 f7 d9                      negq    %rcx
     4ae: 48 01 c8                      addq    %rcx, %rax
     4b1: 48 83 c0 01                   addq    $1, %rax
     4b5: 66 48 0f 6e c0                movq    %rax, %xmm0
     4ba: 44 89 f7                      movl    %r14d, %edi
     4bd: e8 00 00 00 00                callq   0x4c2 <fma+0x4c2>
     4c2: e9 ab fb ff ff                jmp     0x72 <fma+0x72>
     4c7: e8 00 00 00 00                callq   0x4cc <fma+0x4cc>

While the simpler return ((x * y) + z); generates:

0000000000000000 <fma>:
       0: 55                            pushq   %rbp
       1: 48 89 e5                      movq    %rsp, %rbp
       4: f2 0f 59 c1                   mulsd   %xmm1, %xmm0
       8: f2 0f 58 c2                   addsd   %xmm2, %xmm0
       c: 5d                            popq    %rbp
       d: c3                            retq

Obviously this is out of scope for this change, but the problem looks bigger than just arm64. Trying to outsmart the compiler no longer makes sense here.

Could we wrap __builtin_fma* in __has_builtin and always use the builtins if available?

What about fmal?

...

While the simpler return ((x * y) + z); generates:

0000000000000000 <fma>:
       0: 55                            pushq   %rbp
       1: 48 89 e5                      movq    %rsp, %rbp
       4: f2 0f 59 c1                   mulsd   %xmm1, %xmm0
       8: f2 0f 58 c2                   addsd   %xmm2, %xmm0
       c: 5d                            popq    %rbp
       d: c3                            retq

Obviously this is out of scope for this change, but the problem looks bigger than just arm64. Trying to outsmart the compiler no longer makes sense here.

The man page for fma mentions they should only have one rounding error while ((x * y) + z) could have 2.

Could we wrap __builtin_fma* in __has_builtin and always use the builtins if available?

The compiler is free to implement the builtin as a function call, e.g. on arm64 __builtin_fmal will result in a function call to fmal. I'm not sure if the compiler has something we can check if the builtin is a function call or not.

  • Move the builtin to the original C file
  • Add more functions
andrew retitled this revision from Use a builtin to implement the arm64 fma/fmaf to Use a builtin where possible in msun.Nov 3 2021, 1:20 PM
andrew edited the summary of this revision. (Show Details)

Fix the fminf/fmaxf checks

This seems like a reasonable approach to me and will make it simple to do the same for RISC-V or others.

Do we know that GCC has appropriate builtins also?

It does, although we need to build the sqrt functions with -fno-math-errno to handle the < -0.0 case correctly.

This revision was not accepted when it landed; it landed in state Needs Review.Nov 19 2021, 11:56 AM
This revision was automatically updated to reflect the committed changes.