This makes use of the alternate address space support in both GCC and
clang to access per-CPU data as accesses relative to GS:. The
original motivation for this is that it quiets verbose warnings from
GCC 12. However, this version is also much easier to read and
allows the compiler to generate better code (e.g. the compiler can
use a GS: memory operand directly in other instructions such as IMUL
and CMP rather than always MOVing to a temporary register).
The one caveat is that the current approach is very inefficient at -O0
since the compiler expects to load the 0 base offset from a global
variable instead of assuming it is 0 (even with the const).