Page MenuHomeFreeBSD

Consistently provide ffs/fls using builtins
ClosedPublic

Authored by mhorne on Jun 21 2023, 4:39 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 24, 5:49 PM
Unknown Object (File)
Fri, Jan 24, 5:18 PM
Unknown Object (File)
Sat, Jan 18, 5:57 PM
Unknown Object (File)
Tue, Jan 14, 3:53 AM
Unknown Object (File)
Fri, Jan 10, 1:18 AM
Unknown Object (File)
Mon, Jan 6, 6:09 PM
Unknown Object (File)
Dec 14 2024, 10:17 PM
Unknown Object (File)
Nov 27 2024, 1:20 AM
Subscribers

Details

Summary

Use of compiler builtin ffs/ctz functions will result in optimized
instruction sequences when possible, and fall back to calling a function
provided by the compiler runtime library. We have slowly shifted our
platforms to take advantage of these builtins in 60645781d613 (arm64),
1c76d3a9fbef (arm), 9e319462a03a (powerpc, partial), D40594 (riscv,
proposed).

Some platforms still rely on the libkern implementations of these
functions provided by libkern, namely riscv, powerpc (ffs*, flsll), and
i386 (ffsll and flsll). These routines are slow, as they perform a
linear search for the bit in question. Even on platforms lacking
dedicated bit-search instructions, such as riscv, the compiler library
will provide better-optimized routines, e.g. by using binary search.

Consolidate the existing builtin implementations in sys/libkern.h, but
with the ability for a specific machine-dependent implementation to be
provided for each function by machine/cpufunc.h. amd64 and i386 make use
of this to provide fls* using bsrl/bsrq instructions.

One wart in all of this is the existing HAVE_INLINE_F*** macros, which
we use in a few places to conditionally avoid the slow libkern routines.
These aren't easily removed in one commit. For now, provide these
defines unconditionally, but marked for removal after subsequent
cleanup.

Removal of the now unused libkern routines will follow in a separate
commit.

Test Plan

Passes tinderbox.

I have not done any performance testing. Speculatively, the affected platforms (riscv, powerpc, i386) should perform equivalently or better in the few affected cases (grep for HAVE_INLINE_F).

Kernel stats for tier-1 platforms:

amd64 before:
      text      data       bss        dec         hex   filename
  20539806   1616925   4668928   26825659   0x19953bb   kernel.full

amd64 after:
      text      data       bss        dec         hex   filename
  20539726   1616925   4668928   26825579   0x199536b   kernel.full

  Small decrease in .text (Not bad, but why?).

-----------------------------------------------------------------------

arm64 before:
      text      data       bss        dec        hex   filename
  12163601   1390320   2895872   16449793   0xfb0101   kernel.full

arm64 after:
      text      data       bss        dec        hex   filename
  12163601   1390320   2895872   16449793   0xfb0101   kernel.full

  No change (expected).

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

This revision is now accepted and ready to land.Jun 21 2023, 5:23 PM

the size thing needs to get explained first

sys/amd64/include/cpufunc.h
129–130

why are any of these still here

When might this change get committed?

In D40698#926406, @mjg wrote:

the size thing needs to get explained first

It seems to be the result of slightly different codegen before and after the change. I tried to produce a minimal example in godbolt, but could not. Switching from the bare macro to inline function (with casts) must make some small semantic difference to the compiler, but not in all scenarios.

It is interesting that after switching from clang 15 to 16, the .text size actually grew slightly after this change. In any case it is a negligible impact either way.

clang 15 before:
      text      data       bss        dec         hex   filename
  20558716   1620945   4664960   26844621   0x1999dcd   kernel.full

clang 15 after:
      text      data       bss        dec         hex   filename
  20558652   1620945   4664960   26844557   0x1999d8d   kernel.full (-91 bytes)

clang 16 before:
      text      data       bss        dec         hex   filename
  20512708   1621025   4664832   26798565   0x198e9e5   kernel.full

clang 16 after:
      text      data       bss        dec         hex   filename
  20512724   1621025   4664832   26798581   0x198e9f5   kernel.full (+16 bytes)

When might this change get committed?

I aim to do that this week. I am about to update the review, hold tight.

Remove the hand-written implementations from amd64/i386.

This revision now requires review to proceed.Jul 4 2023, 6:29 PM
This revision is now accepted and ready to land.Jul 4 2023, 7:16 PM
This revision was automatically updated to reflect the committed changes.