Page MenuHomeFreeBSD

[RFC/Proposal] Mechanism for in-kernel AT_FDCWD substitution with provided FD for oblivious sandboxing with Capsicum
AbandonedPublic

Authored by val_packett.cool on Feb 2 2023, 7:27 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jan 6, 2:19 PM
Unknown Object (File)
Sat, Jan 4, 11:51 AM
Unknown Object (File)
Tue, Dec 31, 10:02 AM
Unknown Object (File)
Dec 15 2024, 5:24 AM
Unknown Object (File)
Dec 4 2024, 5:34 AM
Unknown Object (File)
Nov 25 2024, 7:44 PM
Unknown Object (File)
Nov 25 2024, 9:16 AM
Unknown Object (File)
Nov 24 2024, 3:26 AM

Details

Reviewers
brooks
Group Reviewers
capsicum
Summary

This is currently just a request for comments on the general idea and not for code review!

So this is the proposal I promised in D37945

Background

Oblivious sandboxing with Capsicum has been a very attractive idea for me. The capability-based sandbox is very strict, and does not show up in any places where state is outside of the process (yeah I'm way too obsessed with clean mount and jls output haha). A few years ago I have experimented with improving libpreopen, the LD_PRELOAD library that converts AT_FDCWD to file descriptors according to a path map, and making a launcher around it. I managed to get GUI applications running, including with GPU access, but the LD_PRELOAD way of doing things was quite frustrating. *All* the relevant syscall wrappers have to be hooked, which is tedious work with somewhat fragile results. Raw syscall usage (hello golang I hate you for this) isn't hooked. If only there was a way to modify the one point all these syscalls converge at, the filesystem lookup…

…yeah, that's in the kernel. So my first experiment was "kpreopen", basically a port of the whole libpreopen idea into the kernel, simply located in the namei function where all (???) the syscalls looking up paths converge at. It was a cool experiment! But I realized a much bigger limitation with the whole path map idea in general…

…that is: if program itself were to use openat on these paths, it would unavoidably confuse itself into a weird split-brain view of the file namespace! Say we have preopened /tmp/one as /usr/local and /tmp/two as /usr/local/bin, then:

openat(AT_FDCWD, "/usr/local/bin/wtf") → /tmp/two/wtf
openat(openat(AT_FDCWD, "/usr/local"), "bin/wtf") → /tmp/one/bin/wtf

Oops. This is absolutely not acceptable. This is cursed broken behavior. BTW, the eBPF version of the same kind of preopening with a path map would have that issue too.

Proposal

So: screw it, we have to throw the whole path map thing away. We actually have to build the filesystem hierarchy in vfs (with nullfs mounts, sysutils/fusefs-sandboxfs, etc.) but then what?

Then we just preopen that root, and we have *one* file descriptor for the kernel to substitute for AT_FDCWD!

Or two, because there's also the current directory thing.. I've experimented with this because I'm currently running FreeBSD in a headless VM and the thing I was experimenting with sandboxing was the Helix text editor, which uses the cwd functionality. It's kinda messy in the prototype, I've just reused fchdir as the syscall for setting the cwd descriptor. But e.g. GUI apps very rarely ever use cwd, we can avoid that whole mess for now like libpreopen has been doing. I think I'll remove the cwd/fchdir stuff in the next version of the prototype.

To sum up, the idea is:

  • there is some call (in the prototype, procctl(PROC_FDCWD_CTL)) that sets a file descriptor for the "fake root"
  • namei will substitute that descriptor for AT_FDCWD, voila
  • now we can add CAPENABLED to various legacy non-*at syscalls that go through namei anyway (DANGER DANGER NEED TESTING)

The most basic sandbox launcher is like this:

#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/procctl.h>
#include <sys/capsicum.h>
extern char **environ;

int main(int argc, char **argv) {
        int lfd = open("/libexec/ld-elf.so.1", O_RDONLY | O_CLOEXEC);
        int pfd = open("/usr/local/bin/hx", O_RDONLY);
        int dfd = open("/tmp/sbox", O_DIRECTORY | O_PATH);
        procctl(P_PID, getpid(), PROC_FDCWD_CTL, &dfd);
        cap_enter();
        char **cargv = calloc(argc + 5, sizeof(void *));
        asprintf(&cargv[0], "sbox:hx");
        cargv[1] = "-f"; asprintf(&cargv[2], "%d", pfd);
        cargv[3] = "--"; memcpy((void *)&cargv[4], argv, argc * sizeof(void *));
        fexecve(lfd, cargv, environ); return 0;
}

Questions and Answers

Why even bother with Capsicum then, why not just use jails?

Why not indeed. Perhaps if the end goal is just shoehorning legacy applications into a sandbox, it's not so worth it.
However, this opens the door for developing capability-aware applications that use existing, non-capability-aware dependencies like GUI toolkits. For making software gradually more capability-aware.

Are you sure this code is safe?

Well, the core mechanism seems obvious, simple and safe, but allowing extra syscalls in capability mode is rather scary and will need lots of serious testing! Absolutely not sure about that stuff! And we don't need *all* the syscalls I've enabled here in the prototype!

What would we do about other namespaces like network?

For now one can start with LD_PRELOAD hooking + IPC things like Casper etc. but I think eventually I'd like to make "address space descriptors" that would be kernel-space capabilities to opening various kinds of sockets (e.g. restricted to only some addresses), so one could just connectat() on that capability to IP addresses with TCP or UDP. And then just like the filesystem thing, a way to substitute one of those capabilities for AT_FDCWD :)

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

val_packett.cool updated this revision to Diff 116305.

added patch context

Are you sure this patch is correctly generated?
I know that you said you want the review of the general idea, but still..

Are you sure this patch is correctly generated?
I know that you said you want the review of the general idea, but still..

Yes. Not sure what could seem incorrectly generated..

Hym, sorry, last time it showed me that you almost added the whole sys/kern/kern_procctl.c. I guess it was some glitch in phabricator.

sys/kern/vfs_lookup.c
363

Oops: due to this (most likely this part) O_RESOLVE_BENEATH without cap mode started doing absolute lookups… hmm

Hah, I've been working on something similar, although from a somewhat different, CHERI-related, angle :)

I wonder if, instead of "fake root", you made it so that it uses the actual root vnode for the process, the one that's changed by chroot(2)? That would require implementing fchroot(2), like NetBSD did, and that's assuming you're ok with the NO_NEW_PRIVS flag set, ie the SUID bits being ignored, because otherwise chroot(2)/fchroot(2) would require root privileges. The cap_enter() would then have to set the process' root vp to NULL, or to some dead vnode, when called.

As for cwd - tbh I don't think you can have anything Unix-like without cwd; everything command-line strongly depends on it. But then I don't see why we can't allow fchdir(2) in Capsicum mode either, as long as cap_enter(2) zeroes it.

I wonder if, instead of "fake root", you made it so that it uses the actual root vnode for the process, the one that's changed by chroot(2)? That would require implementing fchroot(2), like NetBSD did, and that's assuming you're ok with the NO_NEW_PRIVS flag set, ie the SUID bits being ignored, because otherwise chroot(2)/fchroot(2) would require root privileges. The cap_enter() would then have to set the process' root vp to NULL, or to some dead vnode, when called.

Thanks for the excellent suggestions! Makes sense. One thing I just realized with fchroot is that… we then would need to add cap_rights for the root, and fchroot would copy them from the given fd, and they would apply to everything opened under the chroot.

FWIW, I've been playing with this idea on and off, and I have some patches, some of them not even entirely broken :) In particular I have fchroot(2) working: https://reviews.freebsd.org/D41564

Thanks @trasz, I'll experiment with building on top of fchroot. I'll post new proposals as separate revisions and leave this closed as-is for historical reference :)