Details

Reviewers

None

Group Reviewers

Commits

rGdd565d99901a: Use a builtin where possible in msun
rGb2969efae83f: Use a builtin where possible in msun
rGb2e843161dc3: Use a builtin where possible in msun

Summary

Some of the functions in msun can be implemented using a compiler
builtin function to generate a small number of instructions. Implement
this support in fma, fmax, fmin, and sqrt on arm64.

Care must be taken as the builtin can be implemented as a function
call on some architectures that lack direct support. In these cases
we need to use the original code path.

Test Plan

All the fma/fmaf tests pass (with unrelated fmal failures removed)

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

andrew created this revision.Nov 2 2021, 11:51 AM

Herald added subscribers: emaste, imp. · View Herald TranscriptNov 2 2021, 11:51 AM

andrew requested review of this revision.Nov 2 2021, 11:51 AM

Harbormaster completed remote builds in B42517: Diff 97857.Nov 2 2021, 11:51 AM

What about fmal?

Seems that the common implementations don't generate well-optimized code on any platform. RISC-V would benefit similarly from a change identical to this.

On amd64 we can't use a builtin, but the existing fma common implementation generates:

0000000000000000 <fma>:
       0: 55                            pushq   %rbp
       1: 48 89 e5                      movq    %rsp, %rbp
       4: 41 56                         pushq   %r14
       6: 53                            pushq   %rbx
       7: 48 83 ec 60                   subq    $96, %rsp
       b: 48 8b 05 00 00 00 00          movq    (%rip), %rax  # 12 <fma+0x12>
      12: 48 89 45 e8                   movq    %rax, -24(%rbp)
      16: 66 0f 57 db                   xorpd   %xmm3, %xmm3
      1a: 66 0f 2e c3                   ucomisd %xmm3, %xmm0
      1e: 0f 9b c0                      setnp   %al
      21: 0f 94 c1                      sete    %cl
      24: 84 c1                         testb   %al, %cl
      26: 75 0e                         jne     0x36 <fma+0x36>

      ...

     495: 30 ca                         xorb    %cl, %dl
     497: 75 21                         jne     0x4ba <fma+0x4ba>
     499: 66 0f 50 cb                   movmskpd        %xmm3, %ecx
     49d: 48 c1 e1 3f                   shlq    $63, %rcx
     4a1: 48 31 c1                      xorq    %rax, %rcx
     4a4: 48 c1 e9 3e                   shrq    $62, %rcx
     4a8: 83 e1 fe                      andl    $-2, %ecx
     4ab: 48 f7 d9                      negq    %rcx
     4ae: 48 01 c8                      addq    %rcx, %rax
     4b1: 48 83 c0 01                   addq    $1, %rax
     4b5: 66 48 0f 6e c0                movq    %rax, %xmm0
     4ba: 44 89 f7                      movl    %r14d, %edi
     4bd: e8 00 00 00 00                callq   0x4c2 <fma+0x4c2>
     4c2: e9 ab fb ff ff                jmp     0x72 <fma+0x72>
     4c7: e8 00 00 00 00                callq   0x4cc <fma+0x4cc>

While the simpler return ((x * y) + z); generates:

0000000000000000 <fma>:
       0: 55                            pushq   %rbp
       1: 48 89 e5                      movq    %rsp, %rbp
       4: f2 0f 59 c1                   mulsd   %xmm1, %xmm0
       8: f2 0f 58 c2                   addsd   %xmm2, %xmm0
       c: 5d                            popq    %rbp
       d: c3                            retq

Obviously this is out of scope for this change, but the problem looks bigger than just arm64. Trying to outsmart the compiler no longer makes sense here.

Could we wrap __builtin_fma* in __has_builtin and always use the builtins if available?

In D32801#740220, @mhorne wrote:

What about fmal?

...

While the simpler return ((x * y) + z); generates:
0000000000000000 <fma>:
       0: 55                            pushq   %rbp
       1: 48 89 e5                      movq    %rsp, %rbp
       4: f2 0f 59 c1                   mulsd   %xmm1, %xmm0
       8: f2 0f 58 c2                   addsd   %xmm2, %xmm0
       c: 5d                            popq    %rbp
       d: c3                            retq
Obviously this is out of scope for this change, but the problem looks bigger than just arm64. Trying to outsmart the compiler no longer makes sense here.

The man page for fma mentions they should only have one rounding error while ((x * y) + z) could have 2.

In D32801#740221, @emaste wrote:

Could we wrap __builtin_fma* in __has_builtin and always use the builtins if available?

The compiler is free to implement the builtin as a function call, e.g. on arm64 __builtin_fmal will result in a function call to fmal. I'm not sure if the compiler has something we can check if the builtin is a function call or not.

Move the builtin to the original C file
Add more functions

Harbormaster completed remote builds in B42541: Diff 97925.Nov 3 2021, 1:19 PM

andrew retitled this revision from Use a builtin to implement the arm64 fma/fmaf to Use a builtin where possible in msun.Nov 3 2021, 1:20 PM

andrew edited the summary of this revision. (Show Details)

Fix the fminf/fmaxf checks

Harbormaster completed remote builds in B42542: Diff 97926.Nov 3 2021, 1:24 PM

This seems like a reasonable approach to me and will make it simple to do the same for RISC-V or others.

Do we know that GCC has appropriate builtins also?

It does, although we need to build the sqrt functions with -fno-math-errno to handle the < -0.0 case correctly.

Set -fno-math-errno

Harbormaster completed remote builds in B42562: Diff 97951.Nov 3 2021, 3:32 PM

ping

This revision was not accepted when it landed; it landed in state Needs Review.Nov 19 2021, 11:56 AM

Closed by commit rGb2e843161dc3: Use a builtin where possible in msun (authored by andrew). · Explain Why

This revision was automatically updated to reflect the committed changes.

andrew added a commit: rGb2e843161dc3: Use a builtin where possible in msun.

andrew added a commit: rGb2969efae83f: Use a builtin where possible in msun.Dec 14 2021, 11:08 AM

dim added a commit: rGdd565d99901a: Use a builtin where possible in msun.Aug 13 2023, 8:43 AM

Use a builtin where possible in msun
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 98749

lib/msun/Makefile

lib/msun/aarch64/Makefile.inc

lib/msun/src/e_sqrt.c

lib/msun/src/e_sqrtf.c

lib/msun/src/s_fma.c

lib/msun/src/s_fmaf.c

lib/msun/src/s_fmax.c

lib/msun/src/s_fmaxf.c

lib/msun/src/s_fmin.c

lib/msun/src/s_fminf.c

Use a builtin where possible in msunClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 98749

lib/msun/Makefile

lib/msun/aarch64/Makefile.inc

lib/msun/src/e_sqrt.c

lib/msun/src/e_sqrtf.c

lib/msun/src/s_fma.c

lib/msun/src/s_fmaf.c

lib/msun/src/s_fmax.c

lib/msun/src/s_fmaxf.c

lib/msun/src/s_fmin.c

lib/msun/src/s_fminf.c

Use a builtin where possible in msun
ClosedPublic
Actions

Revision Contents
Changeset List