Page MenuHomeFreeBSD

Capsicum vs the Pathnames, a PoC
Needs ReviewPublic

Authored by trasz on Mar 15 2024, 12:48 PM.
Tags
None
Referenced Files
Unknown Object (File)
Dec 9 2024, 4:04 AM
Unknown Object (File)
Nov 17 2024, 4:39 PM
Unknown Object (File)
Nov 8 2024, 12:23 PM
Unknown Object (File)
Oct 31 2024, 12:42 AM
Unknown Object (File)
Oct 18 2024, 5:32 AM
Unknown Object (File)
Sep 29 2024, 10:03 PM
Unknown Object (File)
Sep 24 2024, 3:27 AM
Unknown Object (File)
Sep 5 2024, 3:48 PM

Details

Reviewers
brooks
val_packett.cool
jonathan
Group Reviewers
capsicum
Summary

This is a proof of concept implementation of some changes to how Capsicum
handles path names. It's in some ways similar to D38351 by Val Packett,
but implemented quite differently. The primary motivation is to make it possible
to execute binaries in capability mode from the start, without having to trust them.

The way this works now is that absolute path lookups are prohibited,
and relative are only allowed with an explicitely provided directory
descriptor.

The works it works with the patch is that both are allowed, but only
if the process - or its ancestor - called fchdir(2) and fchroot(2)
to set the descriptors the (nowly allowed) lookups are relative to.
Calling cap_enter(2) clears both descriptors again.

There is a (pretty terrible, and obviously temporary) hack
to chroot(8) utility to run binaries in capability mode "by hand":

$ chroot -Cdn 5 /bin/sh 5< /

Regarding the Capsicum security model, I believe the lookup change doesn't change it.
The directory descriptors for lookups still need to be provided by the process,
like before; it's just that now it can ask the kernel to use them for absolute
and relative lookups instead of having to explicitly pass them to APIs like openat(2).

Sponsored by: Innovate UK

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 56625
Build 53513: arc lint + arc unit

Event Timeline

I worry somewhat about interactions with dlopen which was previously disabled in capability mode by virtual of breaking open(2). It's true that fdlopen existed, but that's a somewhat different beast and I suspect users are more likely to be audited.

I kind of want to disallow fchroot in capability mode and have a cap_enter2 that takes a root fd and a flags argument that includes flags to disable this functionality, but that also feels like it adds complexity.

sys/kern/kern_mib.c
102

Not obviously related to the rest of this patch. Seems generally fine though.

sys/kern/syscalls.master
146

At first glance, I find myself wanting a separate flag from SYF_CAPENABLED so we can potentially deny these syscalls in syscallenter if both curdir and root are ecapmodevp. I'm not sure this is actually a good idea, but it's easier to make the annotations in syscalls.master different now.

160

Leakage from D44372?

usr.bin/procstat/procstat_files.c
148

Old binaries will still use this so might as well keep it until we're ready to completely remove API support.

Can you describe the dlopen threat model a bit more? My assumption is, a typical Capsicum-aware app wouldn't be setting the rootdir/curdir at all. Or, if it does, it could call cap_enter(2) again before calling dlopen(3), clearing those vnodes.

Also, do you think it makes sense to split off fchroot(2) and get that bit committed first?

sys/kern/kern_mib.c
102

Yeah, bmake refuses to work without it. I suppose it should be fixed in bmake and not here though; this often contains personal information (builder's hostname and login).

sys/kern/syscalls.master
146

I was thinking about something similar - a separate flag to be used by an explicit sysctl to switch back to the old semantics in case of security bug. It did make the patch quite a bit larger though.

160

Yup.

usr.bin/procstat/procstat_files.c
148

Makes sense.

Can you describe the dlopen threat model a bit more? My assumption is, a typical Capsicum-aware app wouldn't be setting the rootdir/curdir at all. Or, if it does, it could call cap_enter(2) again before calling dlopen(3), clearing those vnodes.

My worry is dlopen calls the developer is unaware of (or not thinking about) suddenly working. For example things that look like iconv or nss that didn't used to work and now could be coerced to work. I'm not sure how serious an issue this is.

Also, do you think it makes sense to split off fchroot(2) and get that bit committed first?

It probably does make sense to commit separately. This review is pretty large.

sys/kern/kern_mib.c
102

Hmm, could make is a SYSCTL_PROC and output something more reserved in capability mode?

@trasz : thanks for sending this review request. My general feeling is that I'm leery of relaxing the in-kernel security model, not just because of the potential for opening things we don't mean to open, but also because it complicates the model for those who are trying to understand it. "No global namespaces", while limiting, is a clearer rule than "no global namespaces unless you or your ancestor has previously called fchroot(2), unless-unless something has also called cap_enter(2) again to clear that magic vnode".

I wonder if, instead of changing the in-kernel model, this might be better addressed through interposition, either using LDPRELOAD-ed wrappers that convert open(2) to openat(2) (relative a pre-set "root" FD) or within libc itself?

In fact, maybe I should connect you with a student of mine who is playing with LDPRELOAD-ed wrappers in order to run unmodified installer scripts like RVM and (hopefully soon) rustup...

I wonder if, instead of changing the in-kernel model, this might be better addressed through interposition, either using LDPRELOAD-ed wrappers that convert open(2) to openat(2) (relative a pre-set "root" FD) or within libc itself?

This whole thing originates with me being too frustrated with how janky and non-robust LD_PRELOAD-based hackery was feeling! :) The proposal I submitted for discussion was a kernel-based very literal equivalent to libpreopen, which due to being in the kernel could just do the substitution in the *one* central place in the codebase where all FS lookups go through, instead of having to hook *every* entry point that eventually ends up doing an FS lookup in the kernel, which is just not viable.

But sure, doing this(*) in the actual libc itself, while still a lot more tedious than in the kernel for the aforementioned "one central place" advantage the kernel has, definitely has the potential to be robust! Normal operation, no "special" interposition at runtime + as the libc is part of the base system, making sure this thing works with new syscall additions and changes would be part of normal maintenance.

(*) about "this" though i.e. what do we even want: I've been admiring the direction WASI is going towards with WebAssembly Component Model… Where e.g. "the file system" is a defined interface which can have multiple implementations, and when launching a process you could compose the environment in whatever arrangement you want: plug the "real file system" component into the "main program", or instead plug a "virtual in-memory file system" or a "remote file system" one, or one that combines many of those…

*and*, at the same time, I've been really longing for some way of constructing a virtual filesystem tree for a sandboxed process subtree (jailed or this-kind-of-thing-substituted-capsicumized) that would NOT involve "persistent" (/ outwardly visible) state like tmpfs's with nullfs mounts that are present in the global VFS namespace and visible in the mount output and so on. Call me shallow and superficial but I really hate all that stuff "sticking out" like that, it offends me aesthetically xD Linux filesystem namespaces at least let you "tuck it all away" and (importantly!) tie it directly to the lifecycle of that process subtree so you won't ever end up with "leftovers", but this Component Model style composability feels far superior honestly.

I'm not sure how to implement that well in the down-to-earth Unix/C world though, with file descriptors being single numbers handled purely by the kernel right now, and all that… I guess nsswitch is a precedent for pluggable stuff in the libc, but that's an easy case as it doesn't have anything like object handles. One idea for handling VFS—rather round-trippy in the single process case but would allow for incredible flexibility in terms of structuring how the environment ends up at runtime—is to make a sort of "VFS-in-userspace" system. Imagine this: a syscall to create a producer-consumer pair of capability descriptors. The consumer can be duplicated, inherited, passed over sockets, used with *at operations by itself, inserted into this fchroot(); and all the operations on it (and on virtual descriptors created by it) become requests to the producer side, which that side must handle via kqueue or something.

@trasz : thanks for sending this review request. My general feeling is that I'm leery of relaxing the in-kernel security model, not just because of the potential for opening things we don't mean to open, but also because it complicates the model for those who are trying to understand it. "No global namespaces", while limiting, is a clearer rule than "no global namespaces unless you or your ancestor has previously called fchroot(2), unless-unless something has also called cap_enter(2) again to clear that magic vnode".

Ah, but my whole point here is (that I believe) it _doesn't_ change the security model :)

Perhaps we understand the term "global namespace" differently. To me, this doesn't do anything with a global namespace - it's about the kernel doing the mapping instead of libc, like Val described. One difference is that this mapping is inherited from the parent; with mapping in userspace you'd inherit them as ordinary file descriptors. You're not supposed to stash the system's or jails' actual root file descriptor there; I imagine that typically it would either be a premade, read-only system image, or something synthetic.

Or perhaps it's my explanation above, which describes the implementation rather than the way to use it. For the users, the mental model would be "instead of explicitly passing file descriptor to openat(2) every time you can pre-set it using fchroot(2) and fchdir(2)".

I wonder if, instead of changing the in-kernel model, this might be better addressed through interposition, either using LDPRELOAD-ed wrappers that convert open(2) to openat(2) (relative a pre-set "root" FD) or within libc itself?

I can see two problems there. First is that without inheriting cwd and rootfd (of some kind) from parent you can't have something that resembles Unix shell. Second - when starting a new process you need to somehow find rtld, then shared libraries. Sure, can be done, but with the above in kernel you don't need to. And finally you have static binaries and weird runtimes, like golang.

In fact, maybe I should connect you with a student of mine who is playing with LDPRELOAD-ed wrappers in order to run unmodified installer scripts like RVM and (hopefully soon) rustup...

Yes please :)

Reading the proposal, I sense that this would make capsicumization of command-line programs which convert argv[] entries to capabilities (i.e., which process a list of files) much easier. Rather than having to use cap_fileargs (which is expensive and has some functional thorns, and requires some refactoring to pass the casper channel around) or refactor everything to use openat() (not always practical), the program can instead

  • open the root dir and working dir,
  • limit rights on those dir fds,
  • call cap_enter()
  • call fchdir() and fchroot() with the aforementioned dirfds

Then, the program should be able to process argv[] entries without any further modification, which would make life much easier. This leaves open the possibility that the sandboxed process would be able to open files not listed in the argv[], but assuming that rights are appropriately limited on the dirfds, I think this is probably an acceptable tradeoff for many programs.

It would be quite nice if one could do this without requiring privileges. In particular, if one is just using fchroot() to make a dirfd visible to capsicum without actually changing the root directory (i.e., the root vnode doesn't change), can we get away without requiring that?

Reading the proposal, I sense that this would make capsicumization of command-line programs which convert argv[] entries to capabilities (i.e., which process a list of files) much easier. Rather than having to use cap_fileargs (which is expensive and has some functional thorns, and requires some refactoring to pass the casper channel around) or refactor everything to use openat() (not always practical), the program can instead

  • open the root dir and working dir,
  • limit rights on those dir fds,
  • call cap_enter()
  • call fchdir() and fchroot() with the aforementioned dirfds

Then, the program should be able to process argv[] entries without any further modification, which would make life much easier. This leaves open the possibility that the sandboxed process would be able to open files not listed in the argv[], but assuming that rights are appropriately limited on the dirfds, I think this is probably an acceptable tradeoff for many programs.

It would be quite nice if one could do this without requiring privileges. In particular, if one is just using fchroot() to make a dirfd visible to capsicum without actually changing the root directory (i.e., the root vnode doesn't change), can we get away without requiring that?

That's precisely the idea - and it explains it much better than my description above. And yes, fchroot(2) can be used unprivileged same way chroot(2) can, just set PROC_NO_NEW_PRIVS_ENABLE; for now you also need to set security.bsd.unprivileged_chroot=1.

One nit with the current PoC is that you can't limit rights on those directory fds. Our lookup code tries hard to avoid having to track rights during lookup, and I've left it that way, so for now fchdir(2) and fchroot(2) require the default full rights on directories passed to them.

Reading the proposal, I sense that this would make capsicumization of command-line programs which convert argv[] entries to capabilities (i.e., which process a list of files) much easier. Rather than having to use cap_fileargs (which is expensive and has some functional thorns, and requires some refactoring to pass the casper channel around) or refactor everything to use openat() (not always practical), the program can instead

  • open the root dir and working dir,
  • limit rights on those dir fds,
  • call cap_enter()
  • call fchdir() and fchroot() with the aforementioned dirfds

Then, the program should be able to process argv[] entries without any further modification, which would make life much easier. This leaves open the possibility that the sandboxed process would be able to open files not listed in the argv[], but assuming that rights are appropriately limited on the dirfds, I think this is probably an acceptable tradeoff for many programs.

It would be quite nice if one could do this without requiring privileges. In particular, if one is just using fchroot() to make a dirfd visible to capsicum without actually changing the root directory (i.e., the root vnode doesn't change), can we get away without requiring that?

That's precisely the idea - and it explains it much better than my description above. And yes, fchroot(2) can be used unprivileged same way chroot(2) can, just set PROC_NO_NEW_PRIVS_ENABLE; for now you also need to set security.bsd.unprivileged_chroot=1.

Indeed, but if we start sandboxing command-line applications this way, they'll presumably be broken if someone sets security.bsd.unprivileged_chroot=0, which seems rather fragile. So, I'm wondering if calling fchroot() without actually changing the root vnode can be a special operation which doesn't require privilege checking.

One nit with the current PoC is that you can't limit rights on those directory fds. Our lookup code tries hard to avoid having to track rights during lookup, and I've left it that way, so for now fchdir(2) and fchroot(2) require the default full rights on directories passed to them.

Hmm, that's rather unfortunate. By "tries hard" you mean that having non-default rights causes namei() to take a slow path?

Reading the proposal, I sense that this would make capsicumization of command-line programs which convert argv[] entries to capabilities (i.e., which process a list of files) much easier. Rather than having to use cap_fileargs (which is expensive and has some functional thorns, and requires some refactoring to pass the casper channel around) or refactor everything to use openat() (not always practical), the program can instead

  • open the root dir and working dir,
  • limit rights on those dir fds,
  • call cap_enter()
  • call fchdir() and fchroot() with the aforementioned dirfds

Then, the program should be able to process argv[] entries without any further modification, which would make life much easier. This leaves open the possibility that the sandboxed process would be able to open files not listed in the argv[], but assuming that rights are appropriately limited on the dirfds, I think this is probably an acceptable tradeoff for many programs.

It would be quite nice if one could do this without requiring privileges. In particular, if one is just using fchroot() to make a dirfd visible to capsicum without actually changing the root directory (i.e., the root vnode doesn't change), can we get away without requiring that?

That's precisely the idea - and it explains it much better than my description above. And yes, fchroot(2) can be used unprivileged same way chroot(2) can, just set PROC_NO_NEW_PRIVS_ENABLE; for now you also need to set security.bsd.unprivileged_chroot=1.

Indeed, but if we start sandboxing command-line applications this way, they'll presumably be broken if someone sets security.bsd.unprivileged_chroot=0, which seems rather fragile. So, I'm wondering if calling fchroot() without actually changing the root vnode can be a special operation which doesn't require privilege checking.

Not sure if I follow. How would fchroot(2) work without changing the root vnode?

As for the sysctl - honestly, the sysctl was supposed to be a "chicken bit" to disable it in case there was a security problem with it. It should be set by default, and removed some time after. It's just I never got to doing it.

One nit with the current PoC is that you can't limit rights on those directory fds. Our lookup code tries hard to avoid having to track rights during lookup, and I've left it that way, so for now fchdir(2) and fchroot(2) require the default full rights on directories passed to them.

Hmm, that's rather unfortunate. By "tries hard" you mean that having non-default rights causes namei() to take a slow path?

It's not really a slow path, it's just it avoids allocating and passing rights unless required. I have no idea what performance impact is there, if any.

Reading the proposal, I sense that this would make capsicumization of command-line programs which convert argv[] entries to capabilities (i.e., which process a list of files) much easier. Rather than having to use cap_fileargs (which is expensive and has some functional thorns, and requires some refactoring to pass the casper channel around) or refactor everything to use openat() (not always practical), the program can instead

  • open the root dir and working dir,
  • limit rights on those dir fds,
  • call cap_enter()
  • call fchdir() and fchroot() with the aforementioned dirfds

Then, the program should be able to process argv[] entries without any further modification, which would make life much easier. This leaves open the possibility that the sandboxed process would be able to open files not listed in the argv[], but assuming that rights are appropriately limited on the dirfds, I think this is probably an acceptable tradeoff for many programs.

It would be quite nice if one could do this without requiring privileges. In particular, if one is just using fchroot() to make a dirfd visible to capsicum without actually changing the root directory (i.e., the root vnode doesn't change), can we get away without requiring that?

That's precisely the idea - and it explains it much better than my description above. And yes, fchroot(2) can be used unprivileged same way chroot(2) can, just set PROC_NO_NEW_PRIVS_ENABLE; for now you also need to set security.bsd.unprivileged_chroot=1.

Indeed, but if we start sandboxing command-line applications this way, they'll presumably be broken if someone sets security.bsd.unprivileged_chroot=0, which seems rather fragile. So, I'm wondering if calling fchroot() without actually changing the root vnode can be a special operation which doesn't require privilege checking.

Not sure if I follow. How would fchroot(2) work without changing the root vnode?

In the use-case I imagined originally, code would effectively do this:

dfd = open("/", O_DIRECTORY);
cap_rights_limit(dfd, ...)
cap_enter();
fchroot(dfd);
/* now applications can use open("/foo/bar"), subject to rights on `dfd`. */

Here, the root vnode isn't actually changing.

As for the sysctl - honestly, the sysctl was supposed to be a "chicken bit" to disable it in case there was a security problem with it. It should be set by default, and removed some time after. It's just I never got to doing it.

One nit with the current PoC is that you can't limit rights on those directory fds. Our lookup code tries hard to avoid having to track rights during lookup, and I've left it that way, so for now fchdir(2) and fchroot(2) require the default full rights on directories passed to them.

Hmm, that's rather unfortunate. By "tries hard" you mean that having non-default rights causes namei() to take a slow path?

It's not really a slow path, it's just it avoids allocating and passing rights unless required. I have no idea what performance impact is there, if any.

will-it-scale would be a good tool to get a quick evaluation of that cost.

Reading the proposal, I sense that this would make capsicumization of command-line programs which convert argv[] entries to capabilities (i.e., which process a list of files) much easier. Rather than having to use cap_fileargs (which is expensive and has some functional thorns, and requires some refactoring to pass the casper channel around) or refactor everything to use openat() (not always practical), the program can instead

  • open the root dir and working dir,
  • limit rights on those dir fds,
  • call cap_enter()
  • call fchdir() and fchroot() with the aforementioned dirfds

Then, the program should be able to process argv[] entries without any further modification, which would make life much easier. This leaves open the possibility that the sandboxed process would be able to open files not listed in the argv[], but assuming that rights are appropriately limited on the dirfds, I think this is probably an acceptable tradeoff for many programs.

It would be quite nice if one could do this without requiring privileges. In particular, if one is just using fchroot() to make a dirfd visible to capsicum without actually changing the root directory (i.e., the root vnode doesn't change), can we get away without requiring that?

That's precisely the idea - and it explains it much better than my description above. And yes, fchroot(2) can be used unprivileged same way chroot(2) can, just set PROC_NO_NEW_PRIVS_ENABLE; for now you also need to set security.bsd.unprivileged_chroot=1.

Indeed, but if we start sandboxing command-line applications this way, they'll presumably be broken if someone sets security.bsd.unprivileged_chroot=0, which seems rather fragile. So, I'm wondering if calling fchroot() without actually changing the root vnode can be a special operation which doesn't require privilege checking.

Not sure if I follow. How would fchroot(2) work without changing the root vnode?

In the use-case I imagined originally, code would effectively do this:

dfd = open("/", O_DIRECTORY);
cap_rights_limit(dfd, ...)
cap_enter();
fchroot(dfd);
/* now applications can use open("/foo/bar"), subject to rights on `dfd`. */

Here, the root vnode isn't actually changing.

Ah, you mean reusing the real, system-wide root directory? But then we'd need a way to "filter" it, so that the app can't access /home. (I wonder if we could have a right(4) for mount point crossing?)

I was assuming the root would typically be set to a read-only system image, mounted via tarfs or fuse, or perhaps a read-only null mount from zroot/ROOT/default. Thinking more like assets in game engine terms than the actual system-wide root.

As for the sysctl - honestly, the sysctl was supposed to be a "chicken bit" to disable it in case there was a security problem with it. It should be set by default, and removed some time after. It's just I never got to doing it.

One nit with the current PoC is that you can't limit rights on those directory fds. Our lookup code tries hard to avoid having to track rights during lookup, and I've left it that way, so for now fchdir(2) and fchroot(2) require the default full rights on directories passed to them.

Hmm, that's rather unfortunate. By "tries hard" you mean that having non-default rights causes namei() to take a slow path?

It's not really a slow path, it's just it avoids allocating and passing rights unless required. I have no idea what performance impact is there, if any.

will-it-scale would be a good tool to get a quick evaluation of that cost.

Hm, you're right, I could compare the times for lookups with descriptor with full rights(4) vs one that's limited.