This is currently just a request for comments on the general idea and not for code review!
So this is the proposal I promised in D37945…
Background
Oblivious sandboxing with Capsicum has been a very attractive idea for me. The capability-based sandbox is very strict, and does not show up in any places where state is outside of the process (yeah I'm way too obsessed with clean mount and jls output haha). A few years ago I have experimented with improving libpreopen, the LD_PRELOAD library that converts AT_FDCWD to file descriptors according to a path map, and making a launcher around it. I managed to get GUI applications running, including with GPU access, but the LD_PRELOAD way of doing things was quite frustrating. *All* the relevant syscall wrappers have to be hooked, which is tedious work with somewhat fragile results. Raw syscall usage (hello golang I hate you for this) isn't hooked. If only there was a way to modify the one point all these syscalls converge at, the filesystem lookup…
…yeah, that's in the kernel. So my first experiment was "kpreopen", basically a port of the whole libpreopen idea into the kernel, simply located in the namei function where all (???) the syscalls looking up paths converge at. It was a cool experiment! But I realized a much bigger limitation with the whole path map idea in general…
…that is: if program itself were to use openat on these paths, it would unavoidably confuse itself into a weird split-brain view of the file namespace! Say we have preopened /tmp/one as /usr/local and /tmp/two as /usr/local/bin, then:
openat(AT_FDCWD, "/usr/local/bin/wtf") → /tmp/two/wtf openat(openat(AT_FDCWD, "/usr/local"), "bin/wtf") → /tmp/one/bin/wtf
Oops. This is absolutely not acceptable. This is cursed broken behavior. BTW, the eBPF version of the same kind of preopening with a path map would have that issue too.
Proposal
So: screw it, we have to throw the whole path map thing away. We actually have to build the filesystem hierarchy in vfs (with nullfs mounts, sysutils/fusefs-sandboxfs, etc.) but then what?
Then we just preopen that root, and we have *one* file descriptor for the kernel to substitute for AT_FDCWD!
Or two, because there's also the current directory thing.. I've experimented with this because I'm currently running FreeBSD in a headless VM and the thing I was experimenting with sandboxing was the Helix text editor, which uses the cwd functionality. It's kinda messy in the prototype, I've just reused fchdir as the syscall for setting the cwd descriptor. But e.g. GUI apps very rarely ever use cwd, we can avoid that whole mess for now like libpreopen has been doing. I think I'll remove the cwd/fchdir stuff in the next version of the prototype.
To sum up, the idea is:
- there is some call (in the prototype, procctl(PROC_FDCWD_CTL)) that sets a file descriptor for the "fake root"
- namei will substitute that descriptor for AT_FDCWD, voila
- now we can add CAPENABLED to various legacy non-*at syscalls that go through namei anyway (DANGER DANGER NEED TESTING)
The most basic sandbox launcher is like this:
#include <fcntl.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <sys/procctl.h> #include <sys/capsicum.h> extern char **environ; int main(int argc, char **argv) { int lfd = open("/libexec/ld-elf.so.1", O_RDONLY | O_CLOEXEC); int pfd = open("/usr/local/bin/hx", O_RDONLY); int dfd = open("/tmp/sbox", O_DIRECTORY | O_PATH); procctl(P_PID, getpid(), PROC_FDCWD_CTL, &dfd); cap_enter(); char **cargv = calloc(argc + 5, sizeof(void *)); asprintf(&cargv[0], "sbox:hx"); cargv[1] = "-f"; asprintf(&cargv[2], "%d", pfd); cargv[3] = "--"; memcpy((void *)&cargv[4], argv, argc * sizeof(void *)); fexecve(lfd, cargv, environ); return 0; }
Questions and Answers
Why even bother with Capsicum then, why not just use jails?
Why not indeed. Perhaps if the end goal is just shoehorning legacy applications into a sandbox, it's not so worth it.
However, this opens the door for developing capability-aware applications that use existing, non-capability-aware dependencies like GUI toolkits. For making software gradually more capability-aware.
Are you sure this code is safe?
Well, the core mechanism seems obvious, simple and safe, but allowing extra syscalls in capability mode is rather scary and will need lots of serious testing! Absolutely not sure about that stuff! And we don't need *all* the syscalls I've enabled here in the prototype!
What would we do about other namespaces like network?
For now one can start with LD_PRELOAD hooking + IPC things like Casper etc. but I think eventually I'd like to make "address space descriptors" that would be kernel-space capabilities to opening various kinds of sockets (e.g. restricted to only some addresses), so one could just connectat() on that capability to IP addresses with TCP or UDP. And then just like the filesystem thing, a way to substitute one of those capabilities for AT_FDCWD :)