(draft) i386 atomics: Implement 64-bit loading with SSE2
Needs ReviewPublic
Actions

Authored by olce on Dec 12 2024, 10:25 PM.

Details

Reviewers

markj
kib
emaste

Summary

For discussion. There are two parts in the change: kernel and userland.

The kernel part is not functionally necessary, but a proof of concept of using SSE2 which may or may not be more performant than the current CMPXCHG8B implementation, but doesn't have the drawback of faulting on read-only mappings. It has been tested in a VM.

The userland part is more straightforward (for reasons indicated in the commit message). The produced assembly code has been verified correct. I didn't create an atomic_load_acq_64() wrapper yet, since there is no pure userland implementation possible for processors before the Pentium/i586. The intent would be to actually rename this userland atomic_load_acq_64_sse2() to atomic_load_acq_64() once support i386 is dropped from the kernel, meaning we are guaranteed to run on at least SSE2-capable processors.

Current commit message:
The current (kernel) implementation, for i586 processors or higher, uses
CMPXCHG8B (with LOCK prefix) which also may write to the destination
(the same value as was read). While this write is invisible to the
C abstract machine, it causes a #GP(0) on read-only mappings, which is
very surprising for an atomic load operation (no other architecture has
this peculiarity).

For i586 processors, an alternative could be using FILD, but that
requires FPU state saving. Given that we are very likely to stop
supporting the i386 architecture in the kernel in 15.0, and that uses in
the kernel have been adapted to avoid the above problem, we didn't
change atomic_load_acq_64_i586().

We still propose an alternative for processors supporting SSE2, using
MOVQ with XMM registers, as the new atomic_load_acq_64_sse2(). It needs
to clear CR0_TS but avoids saving the full FPU/SSE/AVX state by
temporarily storing into memory the original content of the XMM scratch
register it requires. Provided fiddling with CR0 is not too costly,
this variant, especially on older processors, should be visibly faster
than using CMPXCHG8B with LOCK prefix.

Plug atomic_load_acq_64_sse2() into atomic_load_acq_64 when CPUID_SSE2
is set. Add 'npx.h' to headers to copy from i386 on amd64 to avoid
a compilation error in 32-bit procstat's ZFS support (which needs to
include headers with _KERNEL).

As we are going to continue supporting i386 userland on amd64 and
because all amd64 processors support SSE2, it is now possible to provide
a userland implementation for atomic_load_acq_64()/atomic_load_64(). It
is much simpler than the kernel variant as saving the FPU/SSE/AVX state
will be taken care of by the kernel as needed, but also likely slower if
the corresponding thread does not already use the FPU/SSE/AVX (or just
SSE, if XSAVEOPT can track modification for just that state) instruction
sets.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 61128
Build 58012: arc lint + arc unit

Event Timeline

olce created this revision.Dec 12 2024, 10:25 PM

Herald added a subscriber: imp. · View Herald TranscriptDec 12 2024, 10:25 PM

olce requested review of this revision.Dec 12 2024, 10:25 PM

Harbormaster completed remote builds in B61128: Diff 147897.Dec 12 2024, 10:25 PM

olce added a parent revision: D48061: x86 atomics: Remove unused WANT_FUNCTIONS.Dec 12 2024, 10:25 PM

olce retitled this revision from i386 atomics: Implement 64-bit loading with SSE2 to (draft) i386 atomics: Implement 64-bit loading with SSE2.Dec 12 2024, 10:33 PM

olce edited the summary of this revision. (Show Details)

olce mentioned this in D46887: atomics: Constify loads.Dec 16 2024, 3:31 PM

I believe this is done in the wrong order. First, i386 kernel should be removed from the tree, then the userspace portion of this review can be pushed, with removal of non-sse implementation for load (and similar implementation for store should be done as well).

Since you already fired the const change, I suggest to still remove the kernel part, replacing it with KASSERT(0, ("not implemented")); for now.