Page MenuHomeFreeBSD

Wrap signal trampoline into vdso
ClosedPublic

Authored by kib on Nov 12 2021, 2:23 PM.
Tags
None
Referenced Files
F107180970: D32960.id99403.diff
Sat, Jan 11, 8:28 AM
Unknown Object (File)
Wed, Jan 8, 9:37 PM
Unknown Object (File)
Wed, Jan 8, 6:36 PM
Unknown Object (File)
Fri, Jan 3, 7:26 AM
Unknown Object (File)
Thu, Jan 2, 12:25 AM
Unknown Object (File)
Tue, Dec 31, 10:23 PM
Unknown Object (File)
Sun, Dec 29, 12:21 AM
Unknown Object (File)
Thu, Dec 26, 8:20 PM

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/tools/amd64_vdso.sh
52 ↗(On Diff #98683)

wc -c < elf-vdso.so.1 won't print the file name, if you care about having to have the awk

sys/tools/vdso_wrap.S
56 ↗(On Diff #98683)

Can just make this a normal variable rather than the slightly gross "address is the size" approach that is mandated by objcopy (which doesn't work on RISC-V without -fPIE, and the same might be true for AArch64, due to using PC-relative addressing that assumes everything is +/- 1 GiB). Means the SV_DSO_SIG check when reading sv_szsigcode goes away and it can really be an extern int _binary_elf_vdso_so_1_size. Another benefit of the template asm approach that didn't occur to me.

sys/kern/kern_sharedpage.c
324 ↗(On Diff #98683)

I wouldn't be surprised if some of the Linuxulator VDSO code can then reuse this, but that's someone else's problem to figure out :)

sys/tools/vdso_wrap.S
56 ↗(On Diff #98683)

IMO the sigcodesz thing is gross, not the address as value. The later is the standard ELF way to get symbol value. I do not believe that PC-relative addressing is relevant there, all arches must be able to take address of something.

sys/tools/vdso_wrap.S
56 ↗(On Diff #98683)

The point is the code model explicitly assumes that code and data all fit into a 2GiB region, and thus a 32-bit PC-relative offset is sufficient to access any variable. Since the kernel is linked at a high address (but not high enough to be a small negative) and _binary_elf_vdso_so_1_size is between 0 and 2048, it is not within 2^32 bytes of PC, and thus you will get relocation out of range errors when linking on such architectures.

sys/tools/vdso_wrap.S
56 ↗(On Diff #98683)

(Weak symbols are special cased to always use the GOT on AArch64 when building with such a code model, so that weak undef symbols can resolve to 0 and not cause range errors)

sys/tools/vdso_wrap.S
56 ↗(On Diff #98683)

PC is irrelevant there. It is actually resolved at the static linking time, there is no code involved.

FWIW, amd64 is also linked with 'kernel memory model', where code is assumed to be in [TOP-2G, TOP) region (TOP=0xfffff...ffff). Are you sure that amd64 and riscv kernel models assume that both text _and_ data are in 2G region? It would be too limiting, I expect that only text has this micro-optimization restriction.

sys/tools/vdso_wrap.S
56 ↗(On Diff #98683)

Oh good point, I wasn't thinking about the fact that it's in a static initialiser so just a normal R_32/64 relocation.

Yes, that's the code model we compile with for AArch64 and RISC-V (linked at 0xffff000000000000 and 0xffffffc000000000 respectively, which are the lowest higher half addresses that are guaranteed to be supported by their MMUs), it covers both code and data and is the "normal" one that's not "assume everything is in the first 2 GiB". There is no kernel code model outside of x86. On AArch64 you get tiny (everything is within 1 MiB), small (everything is within 4 GiB, not sure why the GCC documentation sys 4, seems like it should be 2, but this is the default and the one I'm talking about) and large which is horribly inefficient (64-bit immediate materialisation everywhere). On RISC-V you get medlow (everything fits in a signed 32-bit integer) and medany (everything is within 2 GiB), there is currently no large (but -fPIC/-fPIE gets you a GOT which lets data grow beyond that, so long as text and the GOT can occupy a single 2 GiB region).

Add knobs to disable vdso prelinking.
Fix bug with missed vdso causing NULL deref.

libexec/rtld-elf/rtld.c
2876 ↗(On Diff #98989)

We could default stack_flags to PF_R | PF_W for preload objs (and we could avoid PT_GNU_STACK in the VDSO if desired)

(i.e., there's no point in introducing a new X-by-default stack case that requires PT_GNU_STACK to turn it off. If in the future we do need a RWX stack specified by a preload obj for some reason we can use PT_GNU_STACK to set it)

2904 ↗(On Diff #98989)

Is there any reason we'd want to check this and fail if we find any? I suspect no.

kib marked 13 inline comments as done.Dec 1 2021, 12:27 AM
kib added inline comments.
libexec/rtld-elf/rtld.c
2876 ↗(On Diff #98989)

In fact, it so happens that default stack mode for kpreload is 0. Right after I read your comment, I wanted to change it to either PF_R|_W or PF_R|_W|_X, but now I think this is actually not bad. I added a comment to make this more explicit.

2904 ↗(On Diff #98989)

Yes, we want to check this, and updated patch contains the check, but not there. But IMO doing it in rtld is too late and either check or the need for relocation bricks the system.

I added the checks to the vdso build scripts.

sys/tools/amd64_vdso.sh
52 ↗(On Diff #98683)

I prefer awk

kib marked 3 inline comments as done.

Assert that there is no relocs in vdso, at the build stage.
Add some comments in load_kpreload(), about stack_mode and relocs.

So will we need to update qemu-bsd-user for this new signaling method?

In D32960#750447, @imp wrote:

So will we need to update qemu-bsd-user for this new signaling method?

There is no new signalling method, kernel still sets up trampoline, and deliver signal by pushing exactly the same sigframe, and jumping to the same signal trampoline code fragment.

What this change does, it the wrapping of the signal trampoline into real shared object. New rtld can handle both new and old kernels, for old rtld everything looks like an old kernel. If mapping the trampoline is the duty of qemu, then yes, eventually you would want to also provide dso instead of bare signal trampoline. To definitively answer your question, I need to know more how u-qemu handles signal delivery.

Also note that vdso is only implemented for amd64 kernel for now.

Fix compilation without COMPAT_43.

sys/conf/kern.pre.mk
49 ↗(On Diff #99360)

Do we need to add elfdump to _elftctools

kib marked an inline comment as done.

Add elfdump to _elftctools

sys/conf/vdso_amd64.ldscript
38–39 ↗(On Diff #99360)

Wrong flags for PT_DYNAMIC and PT_GNU_EH_FRAME

Type           Offset             VirtAddr           PhysAddr
               FileSiz            MemSiz              Flg    Align
LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
               0x00000000000002e6 0x00000000000002e6  R E    0x10
DYNAMIC        0x0000000000000218 0x0000000000000218 0x0000000000000218
               0x00000000000000b0 0x00000000000000b0  R E    0x8
GNU_EH_FRAME   0x0000000000000194 0x0000000000000194 0x0000000000000194
               0x0000000000000014 0x0000000000000014  R E    0x4
sys/conf/vdso_amd64.ldscript
38–39 ↗(On Diff #99360)

Could you please clarify? What specifically is wrong?

sys/conf/vdso_amd64.ldscript
38–39 ↗(On Diff #99360)

Oh, nvm, I am mistaken. Maybe worth a (reminder) comment about everything being in the same page.

kib marked 2 inline comments as done.Dec 2 2021, 7:27 PM
kib added inline comments.
sys/conf/vdso_amd64.ldscript
38–39 ↗(On Diff #99360)

There is some discussion in the sys/tools/amd64_vdso.sh comment. Are you proposing to expand the text? It would become repeating the linker script.

emaste added inline comments.
sys/amd64/amd64/sigtramp.S
47 ↗(On Diff #99370)

I did not verify these

sys/amd64/ia32/ia32_sigtramp.S
51 ↗(On Diff #99370)

did not verify these

sys/kern/imgact_elf.c
133 ↗(On Diff #99370)

Any reason not to make this 1 everywhere (in the near future)?

This revision is now accepted and ready to land.Dec 2 2021, 8:39 PM

Have you verified that elfdump can be built as a bootstrap tool from non-FreeBSD hosts? You should be able to find out by pushing the commit to a GitHub fork of freebsd-src and letting the on-push action run, which will test both Linux and macOS hosts can build a kernel.

Have you verified that elfdump can be built as a bootstrap tool from non-FreeBSD hosts? You should be able to find out by pushing the commit to a GitHub fork of freebsd-src and letting the on-push action run, which will test both Linux and macOS hosts can build a kernel.

I included this change in some WIP that I pushed to GitHub. MacOS failed in bootstrap bmake which does not appear related. The Ubuntu Clang 9 and Clang 10 jobs were successful, although they hide build output so I cannot confirm that elfdump was indeed built/used.

--------------------------------------------------------------
>>> Kernel build for GENERIC started on Thu Dec  2 21:53:52 UTC 2021
--------------------------------------------------------------
===> GENERIC

--------------------------------------------------------------
>>> stage 1: configuring the kernel
--------------------------------------------------------------
Kernel build directory is /home/runner/work/freebsd/build/home/runner/work/freebsd/freebsd/amd64.amd64/sys/GENERIC
Don't forget to do ``make cleandepend && make depend''
0.11user 0.05system 0:00.16elapsed 101%CPU (0avgtext+0avgdata 47752maxresident)k
0inputs+8outputs (0major+21039minor)pagefaults 0swaps

--------------------------------------------------------------
>>> stage 2.3: build tools
--------------------------------------------------------------
0.10user 0.04system 0:00.14elapsed 104%CPU (0avgtext+0avgdata 47548maxresident)k
0inputs+8outputs (0major+15217minor)pagefaults 0swaps

--------------------------------------------------------------
>>> stage 3.1: building everything
--------------------------------------------------------------
linking kernel.full
ctfmerge -L VERSION -g -o kernel.full ...
      text      data       bss        dec         hex   filename
  23356772   1857165   4428672   29642609   0x1c44f71   kernel.full
713.10user 53.08system 6:40.14elapsed 191%CPU (0avgtext+0avgdata 816252maxresident)k
59752inputs+2278192outputs (44098major+15938065minor)pagefaults 0swaps
--------------------------------------------------------------
>>> Kernel build for GENERIC completed on Thu Dec  2 22:00:32 UTC 2021
--------------------------------------------------------------
>>> Kernel(s)  GENERIC built in 400 seconds, ncpu: 2, make -j2
--------------------------------------------------------------

Have you verified that elfdump can be built as a bootstrap tool from non-FreeBSD hosts? You should be able to find out by pushing the commit to a GitHub fork of freebsd-src and letting the on-push action run, which will test both Linux and macOS hosts can build a kernel.

I included this change in some WIP that I pushed to GitHub. MacOS failed in bootstrap bmake which does not appear related. The Ubuntu Clang 9 and Clang 10 jobs were successful, although they hide build output so I cannot confirm that elfdump was indeed built/used.

--------------------------------------------------------------
>>> Kernel build for GENERIC started on Thu Dec  2 21:53:52 UTC 2021
--------------------------------------------------------------
===> GENERIC

--------------------------------------------------------------
>>> stage 1: configuring the kernel
--------------------------------------------------------------
Kernel build directory is /home/runner/work/freebsd/build/home/runner/work/freebsd/freebsd/amd64.amd64/sys/GENERIC
Don't forget to do ``make cleandepend && make depend''
0.11user 0.05system 0:00.16elapsed 101%CPU (0avgtext+0avgdata 47752maxresident)k
0inputs+8outputs (0major+21039minor)pagefaults 0swaps

--------------------------------------------------------------
>>> stage 2.3: build tools
--------------------------------------------------------------
0.10user 0.04system 0:00.14elapsed 104%CPU (0avgtext+0avgdata 47548maxresident)k
0inputs+8outputs (0major+15217minor)pagefaults 0swaps

--------------------------------------------------------------
>>> stage 3.1: building everything
--------------------------------------------------------------
linking kernel.full
ctfmerge -L VERSION -g -o kernel.full ...
      text      data       bss        dec         hex   filename
  23356772   1857165   4428672   29642609   0x1c44f71   kernel.full
713.10user 53.08system 6:40.14elapsed 191%CPU (0avgtext+0avgdata 816252maxresident)k
59752inputs+2278192outputs (44098major+15938065minor)pagefaults 0swaps
--------------------------------------------------------------
>>> Kernel build for GENERIC completed on Thu Dec  2 22:00:32 UTC 2021
--------------------------------------------------------------
>>> Kernel(s)  GENERIC built in 400 seconds, ncpu: 2, make -j2
--------------------------------------------------------------

Thanks. The macOS job on freebsd-src indeed looks slightly temperamental, has the occasional bmake test suite failure, not sure what's up with that. If it builds on Linux then it's probably fine and we can deal with a macOS failure post-commit on the off chance there's something special there.

kib marked 4 inline comments as done.Dec 2 2021, 10:40 PM
kib added inline comments.
sys/amd64/amd64/sigtramp.S
47 ↗(On Diff #99370)

You mean, you did not checked the layout of the frame, or you did not verified that unwinding around signal frame work?

First fact is more or less about matching register name and corresponding UC_XXX offset name. Second is tough, indeed. For instance, gdb, lldb, non-gnu libunwind match signal handler by comparing code sequence with the known one, so this change seems to be nop for them. On the other hand, in-tree unwinder from libgcc_s was able to catch an exception thrown from a signal handler context.

sys/kern/imgact_elf.c
133 ↗(On Diff #99370)

I definitely do not have plans to work on it. It is up to the arch maintainer to do the necessary copy/paste and testing.

In principle, I will prepare the arch patch if somebody going to test that.

kib marked 2 inline comments as done.

Fixes for a.out and freebsd4 signal delivery.

This revision now requires review to proceed.Dec 3 2021, 10:56 AM
emaste added inline comments.
sys/kern/imgact_aout.c
148

I guess this comment is not really applicable anymore (we're not going to add new arch a.out support)

sys/kern/imgact_elf.c
133 ↗(On Diff #99370)

I just mean that for the kernel to provide AT_KPRELOAD __elfN(vdso) must be nonzero and sv_vdso_base must also be nonzero so there's no problem setting it to 1.

I suppose there is some documentary value in having it report 0 to the user if not supported by the arch.

This revision is now accepted and ready to land.Dec 3 2021, 2:44 PM
kib marked 2 inline comments as done.Dec 3 2021, 2:58 PM
kib added inline comments.
sys/kern/imgact_aout.c
148

It is same as it was in the time the line was written. I want something to plug if people add imgact_aout.c to their files.<arch>.

If you suggest to change the message to something else, I do not object.

sys/kern/imgact_elf.c
133 ↗(On Diff #99370)

Might be, it is useful to hide the sysctl at all from not supported arches. I will do this.

sys/tools/vdso_wrap.S
4 ↗(On Diff #99394)

@jhb this was derived from a header carrying a ECATS sponsorship note. The This software was developed by SRI International and the University of Cambridge Computer Laboratory statement seems confusing in this context, should we remove that?

sys/kern/imgact_aout.c
148

Yeah just something I noticed in passing and not related to your work.

sys/kern/imgact_elf.c
133 ↗(On Diff #99370)

OK, I think that is better.

kib marked 2 inline comments as done.

Only expose kern.elfXX.vdso on amd64.

This revision now requires review to proceed.Dec 3 2021, 3:01 PM
sys/amd64/amd64/sigtramp.S
63 ↗(On Diff #99403)

Are there CFI annotations for segment registers, fsbase, gsbase, and eflags? (Those are all present in ucontext_t as well and a recent change I have for GDB pulls all of those out of signal frames.)

sys/amd64/ia32/ia32_sigtramp.S
58 ↗(On Diff #99403)

Similar question about segment registers, fsbase/gsbase (GDB uses those for TLS), and flags.

sys/tools/vdso_wrap.S
36 ↗(On Diff #99403)

No need for this if you aren't planning to merge to 12.

4 ↗(On Diff #99394)

From crtbrand.S or the like? (Or maybe the firmware stub?) There's no useful shared content between those files and this. I would just give this a FF copyright from the get go.

sys/tools/vdso_wrap.S
4 ↗(On Diff #99394)

From the firmware stub

kib marked 8 inline comments as done.Dec 4 2021, 4:15 AM
kib added inline comments.
sys/amd64/amd64/sigtramp.S
63 ↗(On Diff #99403)

All listed registers have DWARF register mapping number assigned to them in psABI, so there is a way to specify that in final object. But trying to actually do that resulted in the mess.

For instance, i386 psABI specifies 93 as %fs.base, but binutils interpret it as %k0. Clang IAS does know about all segment names, but objdump from binutils report all encoded registers as invalid then. And so on. The only semi-consistent pseudo-registers seems to be fs/gs.base on x86_64, but even there clang managed to fail, it does not recognize the registers name, and I have to use numeric value.

On the other hand, I am sure that %eflags are not useful for unwinders, and doubt that segment registers or fs/gsbase are useful.

I think really doing that would require some iterations with toolchain maintainers, for now I commented out cfi instructions for segments/flags everywhere, and bases on 32bit.

sys/tools/vdso_wrap.S
4 ↗(On Diff #99394)

The code was copied from firmware.S, and the only changes made were the variables rename. I did removed that copyright as suggested, but feel free to request a revert.

kib marked 2 inline comments as done.

Add cfi offsets for amd64 fs/gs bases. Most other registers are also annotated but neutralized due to toolchain mess.
Change error message for a.out !ia32 arch.
Remove ECATS copyright from vdso_wrap.S.

Minor tweaks for 64bit vdso cfi annotations. Add a comment explaining toolchain mess.

sys/amd64/amd64/sigtramp.S
69 ↗(On Diff #99434)

Fixed locally.

sys/amd64/amd64/sigtramp.S
63 ↗(On Diff #99403)

This is mostly for debuggers and when inspecting in core dumps. That said, GDB will keep using its own unwinder that will unwind all those registers ok even with out CFI (so long as we don't change the sigcode so the sigframe unwinder keeps matching).

%eflags is possibly useful when debugging a core dump where a thread had caught a signal and you want to see the register state at the time of the faulting instruction, but that is definitely an obscure case.

GDB uses fsbase/gsbase for TLS on amd64/i386, but those likely don't change between the signal handler and the signal frame so missing those is less important.

My other interest is around using CFI for kernel exception frames as that removes the need for custom unwinders in kernel debuggers, but when I last tried that I found I couldn't annotate all of the registers (for some reason I couldn't annotate %eflags at the time. and probably it is what you ran into which is that there is a DWARF register number but IAS doesn't support it).