Page MenuHomeFreeBSD

Add rseq(2)
Needs ReviewPublic

Authored by kib on Oct 15 2021, 12:19 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 24, 8:12 AM
Unknown Object (File)
Thu, Jan 23, 6:47 PM
Unknown Object (File)
Sat, Jan 18, 10:27 PM
Unknown Object (File)
Sat, Jan 18, 7:41 PM
Unknown Object (File)
Sat, Jan 18, 7:41 PM
Unknown Object (File)
Sat, Jan 18, 7:41 PM
Unknown Object (File)
Sat, Jan 18, 7:27 PM
Unknown Object (File)
Sat, Jan 18, 4:46 PM

Details

Summary

Man page used as a reference https://kib.kiev.ua/kib/rseq.pdf

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib requested review of this revision.Oct 15 2021, 12:19 PM

Do we need a man page for this?

In D32505#733432, @imp wrote:

Do we need a man page for this?

Eventually for sure. I cannot write the manual right now. The linux page is provided as URL in the review summary.

@brooks, can you take a look at this from a CheriABI perspective? The current version uses the Linux convention of assuming that uint64_t is a sensible type for memory addresses. We probably can't do that in CheriABI because it would allow you to register a jump address that would make other code jump to your handler (or, conversely, it may prevent you from setting a range in some situations where there's a PCC change in between the rseq setup and the destination).

We have a couple of additional use cases for a slightly more general mechanism that it might be interesting to consider in the implementation of this:

  • For allocator hardening, we'd like to ensure that signals delivered while executing in the allocator don't expose internal allocator state, so we'd like a mechanism somewhat closer to Windows structured exception handling that allows signals to be redirected based on a specific IP range.
  • For userspace cooperative threading (as we're doing in Verona), we wanted to use rseq on Linux for dynamically scaling threads. Ideally, we'd like a lightweight mechanism that lets us jump back to a specific address (with enough state stored on the stack to resume) if we have come back from preemption, so that if we have a long-running task we can create a new thread to handle other work but in the steady state we can be in one-thread-per-core world.
  • For allocator hardening, we'd like to ensure that signals delivered while executing in the allocator don't expose internal allocator state, so we'd like a mechanism somewhat closer to Windows structured exception handling that allows signals to be redirected based on a specific IP range.

Why should this be done in kernel? More, I believe that Windows does not do it in kernel either. Last time I looked (I admit it was very long time ago) they have a single upcall from kernel to userspace for all that stuff. It is usermode duty to interpret signal source + find the corresponding entry in the exceptions ranges table and do the unwind.

More, I think that there is a strong reason why kernel should not do that. You probably need to distinguish between sync and async signals, and further classify them based on si_code before even taking the look at the unwinding, so that only the events you are prepared for, like GC barriers or whatever you know about, started your specific actions. Kernel should not know about all that details.

  • For userspace cooperative threading (as we're doing in Verona), we wanted to use rseq on Linux for dynamically scaling threads. Ideally, we'd like a lightweight mechanism that lets us jump back to a specific address (with enough state stored on the stack to resume) if we have come back from preemption, so that if we have a long-running task we can create a new thread to handle other work but in the steady state we can be in one-thread-per-core world.

Isn't this already handled by the interface?

Or, if you describe KSE, that was removed from kernel.

Thanks for working on this, it would be great if a man page for this addition could be also integrated in this change

In D32505#733659, @kib wrote:
  • For allocator hardening, we'd like to ensure that signals delivered while executing in the allocator don't expose internal allocator state, so we'd like a mechanism somewhat closer to Windows structured exception handling that allows signals to be redirected based on a specific IP range.

Why should this be done in kernel? More, I believe that Windows does not do it in kernel either. Last time I looked (I admit it was very long time ago) they have a single upcall from kernel to userspace for all that stuff. It is usermode duty to interpret signal source + find the corresponding entry in the exceptions ranges table and do the unwind.

More, I think that there is a strong reason why kernel should not do that. You probably need to distinguish between sync and async signals, and further classify them based on si_code before even taking the look at the unwinding, so that only the events you are prepared for, like GC barriers or whatever you know about, started your specific actions. Kernel should not know about all that details.

Because, in a CHERI system, we either have to completely restrict who can install signal handlers to a trusted compartment and proxy everything (which doesn't work well in situations of mutual distrust), or we need to be able to configure signal delivery on a per-compartment basis. If untrusted code (or, at least, code outside of the memory safety TCB) can configure a signal handler invoke the allocator compartment (via a malloc / free call) and trigger a signal then it can see the contents of the allocator's register file. This will include capabilities to the entire heap, breaking memory safety.

  • For userspace cooperative threading (as we're doing in Verona), we wanted to use rseq on Linux for dynamically scaling threads. Ideally, we'd like a lightweight mechanism that lets us jump back to a specific address (with enough state stored on the stack to resume) if we have come back from preemption, so that if we have a long-running task we can create a new thread to handle other work but in the steady state we can be in one-thread-per-core world.

Isn't this already handled by the interface?

Or, if you describe KSE, that was removed from kernel.

KSE is not what we want, we still want to manage a pool of pthreads and inherit all of the pthread semantics (including the thread-local storage ABI), but we want a lightweight mechanism for detecting long-running tasks. We can do that by scheduling SIGALARM, but that is far from ideal because we don't have a lightweight mechanism for rescheduling it when everything is playing nicely. We want the fast path (i.e. all behaviours that we schedule in the Verona runtime have short, bounded behaviour) to be fast and the slow path to exist for fallback.

In D32505#733659, @kib wrote:
  • For allocator hardening, we'd like to ensure that signals delivered while executing in the allocator don't expose internal allocator state, so we'd like a mechanism somewhat closer to Windows structured exception handling that allows signals to be redirected based on a specific IP range.

Why should this be done in kernel? More, I believe that Windows does not do it in kernel either. Last time I looked (I admit it was very long time ago) they have a single upcall from kernel to userspace for all that stuff. It is usermode duty to interpret signal source + find the corresponding entry in the exceptions ranges table and do the unwind.

More, I think that there is a strong reason why kernel should not do that. You probably need to distinguish between sync and async signals, and further classify them based on si_code before even taking the look at the unwinding, so that only the events you are prepared for, like GC barriers or whatever you know about, started your specific actions. Kernel should not know about all that details.

Because, in a CHERI system, we either have to completely restrict who can install signal handlers to a trusted compartment and proxy everything (which doesn't work well in situations of mutual distrust), or we need to be able to configure signal delivery on a per-compartment basis. If untrusted code (or, at least, code outside of the memory safety TCB) can configure a signal handler invoke the allocator compartment (via a malloc / free call) and trigger a signal then it can see the contents of the allocator's register file. This will include capabilities to the entire heap, breaking memory safety.

Lets limit the discussion to rseq(2) and not to some future hypothetical design needed for CheriBSD (which is not FreeBSD).

  • For userspace cooperative threading (as we're doing in Verona), we wanted to use rseq on Linux for dynamically scaling threads. Ideally, we'd like a lightweight mechanism that lets us jump back to a specific address (with enough state stored on the stack to resume) if we have come back from preemption, so that if we have a long-running task we can create a new thread to handle other work but in the steady state we can be in one-thread-per-core world.

Isn't this already handled by the interface?

Or, if you describe KSE, that was removed from kernel.

KSE is not what we want, we still want to manage a pool of pthreads and inherit all of the pthread semantics (including the thread-local storage ABI), but we want a lightweight mechanism for detecting long-running tasks. We can do that by scheduling SIGALARM, but that is far from ideal because we don't have a lightweight mechanism for rescheduling it when everything is playing nicely. We want the fast path (i.e. all behaviours that we schedule in the Verona runtime have short, bounded behaviour) to be fast and the slow path to exist for fallback.

So again I do not quite understand why can't you use rseq as is for what you describe. It lets you detect context switches by kernel, if this is what you want.

In D32505#733678, @kib wrote:

Lets limit the discussion to rseq(2) and not to some future hypothetical design needed for CheriBSD (which is not FreeBSD).

I asked @brooks to review so that we don't introduce an API that we will need to break when the CHERI support is upstreamed.

So again I do not quite understand why can't you use rseq as is for what you describe. It lets you detect context switches by kernel, if this is what you want.

At least two reasons:

  • In the Linux implementation at least, issuing a syscall causes a SIGSEGV to be delivered.
  • The old PC is not stored anywhere so you cannot jump back to the point that the kernel would have resumed you to. This is fine for short idempotent sequences but it can't be used as a general-purpose resume handler.
In D32505#733678, @kib wrote:

Lets limit the discussion to rseq(2) and not to some future hypothetical design needed for CheriBSD (which is not FreeBSD).

I asked @brooks to review so that we don't introduce an API that we will need to break when the CHERI support is upstreamed.

I can change rseq_abi pointer to be uintptr_t, but then the syscall would need compat wrapper. Then rseqlen should be changed to size_t as well, perhaps.

So again I do not quite understand why can't you use rseq as is for what you describe. It lets you detect context switches by kernel, if this is what you want.

At least two reasons:

  • In the Linux implementation at least, issuing a syscall causes a SIGSEGV to be delivered.

This implementation does not issue SIGSEGV. On the other hand, if there was a context switch during syscall execution, then moving the control to abort point is not very useful.
But clearly rseq intent is not not allow large restartable sequences. So may be I should add SIGSEGV after syscall as well, I will think about it.

BTW, do you know, they deliver SIGSEGV after a syscall, or instead of syscall?

  • The old PC is not stored anywhere so you cannot jump back to the point that the kernel would have resumed you to. This is fine for short idempotent sequences but it can't be used as a general-purpose resume handler.

Again, this is clearly outside the design space of rseq.

Again, this is clearly outside the design space of rseq.

What is the goal of this review? I don't see any changes to the Linux ABI layer. I would have no problems with this design if it were exposed only via the Linux syscall interface, for compatibility with Linux software that has managed to use this to some performance benefit but when we have evaluated rseq we have found one of two things:

  • We can get better performance via a different mechanism. For example, snmalloc's message-passing design outperforms SuperMalloc's per-CPU second-level caching design (which could benefit from rseq for popping things from the per-CPU freelists)
  • The mechanism is not sufficiently general and does not support our use case.

As such, we have not found any compelling use case for it. If we are adding a new FreeBSD system call (not a Linux-compat system call) then I would like it to be designed based on experience with attempting to use rseq, in particular:

  • It should have compelling use cases, at least one of them with an implementation of the consumer.
  • It should not break things for any upcoming hardware.

The current implementation does not, as far as I can tell, meet either of those design goals. The Linux ABI design is explicitly documented as not being intended for use by normal programs and requires libc to provide an API for multiplexing it, so creating something source-compatible with whatever glibc does for this API on top of a more general mechanism would be easy if the goal is to make it easier to port Linux software that depends on rseq to FreeBSD.

Refresh and rebase.
Add signature checking for abort handlers.
Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ.

Bugfixes.
Ensure that rseq.cs is cleared on abort.

It's still not clear what the purpose of this is. It's not added to the Linux ABI. If we're adding a new FreeBSD syscall, there should be some design review or at least motivating use cases. Linux' rseq is mostly useless (far less useful than a lightweight userspace interrupt delivery mechanism or a resume-from-context-switch handler) and is *only* vaguely useful on Linux in combination with the fact that Linux has a lightweight get-CPU system call implemented in the VDSO that is cheaper than a CPUID (which is a serialising instruction and generally costs more than the saving of most of the win from doing per-CPU instead of per-thread things).

I would still like to see a design document and some design review before this is added. This feature in Linux has been quite controversial. I can see a case for copying a Linux feature that is widely used, or for adding a feature in the Linuxulator that is needed by certain workloads, but this doesn't seem to meet either of those requirements (not widely used, not being added to the Linux ABI layer here, only as a native syscall).

The goal is to have tcmalloc natively working with full capacity and without further patching. This is why I started working on sched_getaffinity() compat, and why I still handle the rebases. The rseq() API was recently re-added to glibc, and I will make some further changes to facilitate source compatibility there. Anyway, the prerequisite is D32360, which is ready (at least for review).

Linuxolator discussion is irrelevant, if linuxolator people want to have rseq(2), they can (easily) create the layer over the native substrate.

The goal is to have tcmalloc natively working with full capacity and without further patching.

Any memory allocator will have OS-specific abstractions, so expecting one to work without FreeBSD-specific patches is a very strange requirement. We could easily provide an rseq compat library (probably even a header-only library) over a well-designed kernel mechanism, without adding a badly designed kernel API to FreeBSD.