Page MenuHomeFreeBSD

namei: Add the abilty for the ABI to specify an alternate root path
ClosedPublic

Authored by dchagin on Mar 6 2023, 7:58 PM.
Referenced Files
F101910721: D38933.diff
Tue, Nov 5, 12:37 PM
Unknown Object (File)
Mon, Nov 4, 1:46 AM
Unknown Object (File)
Wed, Oct 23, 6:49 AM
Unknown Object (File)
Mon, Oct 21, 5:38 PM
Unknown Object (File)
Sun, Oct 20, 10:09 AM
Unknown Object (File)
Sun, Oct 20, 10:09 AM
Unknown Object (File)
Sun, Oct 20, 10:09 AM
Unknown Object (File)
Sun, Oct 20, 10:09 AM

Details

Summary

For now a non-native ABI (i.e., Linux) uses the kern_alternate_path()
facility to dynamically reroot lookups. First, an attempt is made to
lookup the file in /compat/linux/original-path. If that fails, the
lookup is done in /original-path. Thats requires a bit of code in
every ABI syscall implementation where path name translation is needed.
Also our kern_alternate_path() does not properly lookups absolute symlinks
in second attempt, i.e., does not append /compat/linux part to the resolved
link.
The change is intended to avoid this by specifiyng the ABI root directory
for namei(), using one call to pwd_exec() during exec-time into the ABI.
In that case namei() will dynamically reroot lookups as mentioned above.

PR: 72920
MFC after: 1 month

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 50173
Build 47065: arc lint + arc unit

Event Timeline

remove linux_alternate_interp, it cold be next

sys/kern/imgact_elf.c
827

this can be avoided if namei() clean ni_vp on error path

I suspect this is all done in the wrong place/moment. What you need is to specially handle root, not the symlink. Then kern_alternate_path() can go away as well I suspect.

sys/compat/linux/linux_util.c
116

Why do you need ncp? Why not move cp content further up and then directly put linux_emul_path into the right place?

In D38933#886562, @kib wrote:

I suspect this is all done in the wrong place/moment. What you need is to specially handle root, not the symlink. Then kern_alternate_path() can go away as well I suspect.

Are you mean by 'specially handle root' change root aka chroot()? I suspect, you are mean something else.
In my POV kern_alternate_path() is a big issue, it does more unexpected harm than helpful, however, retirement it is a POLA. Or we must have a way to 'mount" some files from /etc into this root.

At least it would be good to change root for the exec() time to guarantee that Linux rtld will not look at non-native directories.

In D38933#886562, @kib wrote:

I suspect this is all done in the wrong place/moment. What you need is to specially handle root, not the symlink. Then kern_alternate_path() can go away as well I suspect.

Are you mean by 'specially handle root' change root aka chroot()? I suspect, you are mean something else.
In my POV kern_alternate_path() is a big issue, it does more unexpected harm than helpful, however, retirement it is a POLA. Or we must have a way to 'mount" some files from /etc into this root.

At least it would be good to change root for the exec() time to guarantee that Linux rtld will not look at non-native directories.

No, I mean a place where the "/" is resolved in namei(). The retry with alternative path should happen somewhere around places near the calls to namei_handle_root().

Redesign, I don't like the interaction between the pwd_exec() and the ABI
due to v_usecount should be incremented in the ABI path, but seems there no
other way? I prefer to not use sx instead of rmlock to call pwd_exec() with sx held

Here is a comparison between the stock kernel and the modified kernel: https://people.freebsd.org/~dchagin/adir/

sys/kern/kern_descrip.c
4010 ↗(On Diff #119214)
4035 ↗(On Diff #119214)
sys/sys/filedesc.h
94 ↗(On Diff #119214)
  1. Back to the char linux_emul_path[], this simplifies the code, adding one namei() call to execve() path.
  1. Linprocfs changes moved to a separate commit

Something does not add up whatsoever with your bench results -- how is this patch supposed to improve scalability for open/close/unlink? Similarly it can't be faster than stock code for regular lookups, but this one is perhaps measurement error.

That aside the entire idea of Linux binaries doing 2 lookups was incredibly dodgy from the get go and I don't think this is helping the fundamental problem, albeit it may be it makes it less iffy.

The added branchfest is definitely not nice, especially the restart clause.

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux or whatever you are looking up against and then have nullfs mounts inside for /home, /tmp and whatever else which makes sense to share. This avoids any suspicious lookups like failing to find a file in Linux because it is missing when it should not and trying to pick up the FreeBSD one. This also avoids adding any complexity to the kernel.

Even if going this route, I think the functionality can be added without pessimizing existing code. Note that for example vfs_lookup is already a standalone routine.

In D38933#897702, @mjg wrote:

Something does not add up whatsoever with your bench results -- how is this patch supposed to improve scalability for open/close/unlink?

This patch is not supposed to improve scalability, I used will-it-scale to check that I do not broke hot path.

That aside the entire idea of Linux binaries doing 2 lookups was incredibly dodgy from the get go and I don't think this is helping the fundamental problem, albeit it may be it makes it less iffy.

The added branchfest is definitely not nice, especially the restart clause.

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux or whatever you are looking up against and then have nullfs mounts inside for /home, /tmp and whatever else which makes sense to share. This avoids any suspicious lookups like failing to find a file in Linux because it is missing when it should not and trying to pick up the FreeBSD one. This also avoids adding any complexity to the kernel.

Even if going this route, I think the functionality can be added without pessimizing existing code. Note that for example vfs_lookup is already a standalone routine.

Well, I mostly agree with all of you statement. Except some sort of pessimization, I tried to minimize touching of the hot path for native binaries.
This patch adds only two compares (pwd->pwd_rdir != pwd->pwd_adir) on error path of namei() for native. I don't think it's worth the cost.

Implementing things you are talking about requires rewriting our Linux emulation ports infrastructure as they installs files into the base not in the ABI directory.

In D38933#897702, @mjg wrote:

Something does not add up whatsoever with your bench results -- how is this patch supposed to improve scalability for open/close/unlink?

This patch is not supposed to improve scalability, I used will-it-scale to check that I do not broke hot path.

I am saying that according to the graph it did improve, markedly so, but this can't be true and consequently the bench is bogus.

That aside the entire idea of Linux binaries doing 2 lookups was incredibly dodgy from the get go and I don't think this is helping the fundamental problem, albeit it may be it makes it less iffy.

The added branchfest is definitely not nice, especially the restart clause.

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux or whatever you are looking up against and then have nullfs mounts inside for /home, /tmp and whatever else which makes sense to share. This avoids any suspicious lookups like failing to find a file in Linux because it is missing when it should not and trying to pick up the FreeBSD one. This also avoids adding any complexity to the kernel.

Even if going this route, I think the functionality can be added without pessimizing existing code. Note that for example vfs_lookup is already a standalone routine.

Well, I mostly agree with all of you statement. Except some sort of pessimization, I tried to minimize touching of the hot path for native binaries.
This patch adds only two compares (pwd->pwd_rdir != pwd->pwd_adir) on error path of namei() for native. I don't think it's worth the cost.

It adds a branch to set things up and another one for failed lookups. clang probably also pessimized namei entry, which already is quite bad. Perhaps I should note there several single-threaded slowdowns remaining, most of them branches and this goes counter to whacking them.

I just realized your restart label is early enough that it virtually counts as a separate namei call, included repeated copy of the buffer et al. If paying that cost, you can *avoid* modifying any of it anyway and instead call namei with a faked pwd -- it would have all the usual vnodes *and* the one for /compat/linux sneaked in as root. Should this fail to produce the result, you use the real pwd. et voila, changing namei avoided.

Another note is that should this kind of double lookup need to be optimized, the way to do it for lockless lookup would be to patch the case which finds a negative entry to check if perhaps another variant is needed. This would avoid *any* work for lookups which do succeed. Anyhow so far it looks like you should be fine rolling with *calling* namei twice.

Implementing things you are talking about requires rewriting our Linux emulation ports infrastructure as they installs files into the base not in the ABI directory.

The current state is a mess and this sounds like an opportunity to sort it out, if feasible.

I was so startled by the supposed scalability diff I did not take a proper look at the other results.

700k ops/s is abysmall, perhaps your cpu is shafted with meltdown or you are running a debug kernel or fast path lookup is disabled in your case

What's more important is that it flattens for scalability, which would not happen for separate file case. I just verified it scales almost linearly up to 52 workers (i don't have more cores on my test box).

tl;dr the test is defo wrong

In D38933#897702, @mjg wrote:

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux or whatever you are looking up against and then have nullfs mounts inside for /home, /tmp and whatever else which makes sense to share. This avoids any suspicious lookups like failing to find a file in Linux because it is missing when it should not and trying to pick up the FreeBSD one. This also avoids adding any complexity to the kernel.

This functionality (double lookup in the ugly current form) was added exactly to avoid requiring users doing what you described above.

sys/kern/kern_descrip.c
4035 ↗(On Diff #119913)

After re-reading, _unexec is the weird name. Could you merge pwd_exec with pwd_unexec, indicating unexec case by vp == NULL?

In D38933#897720, @mjg wrote:
In D38933#897702, @mjg wrote:

Something does not add up whatsoever with your bench results -- how is this patch supposed to improve scalability for open/close/unlink?

This patch is not supposed to improve scalability, I used will-it-scale to check that I do not broke hot path.

I am saying that according to the graph it did improve, markedly so, but this can't be true and consequently the bench is bogus.

That aside the entire idea of Linux binaries doing 2 lookups was incredibly dodgy from the get go and I don't think this is helping the fundamental problem, albeit it may be it makes it less iffy.

The added branchfest is definitely not nice, especially the restart clause.

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux or whatever you are looking up against and then have nullfs mounts inside for /home, /tmp and whatever else which makes sense to share. This avoids any suspicious lookups like failing to find a file in Linux because it is missing when it should not and trying to pick up the FreeBSD one. This also avoids adding any complexity to the kernel.

Even if going this route, I think the functionality can be added without pessimizing existing code. Note that for example vfs_lookup is already a standalone routine.

Well, I mostly agree with all of you statement. Except some sort of pessimization, I tried to minimize touching of the hot path for native binaries.
This patch adds only two compares (pwd->pwd_rdir != pwd->pwd_adir) on error path of namei() for native. I don't think it's worth the cost.

It adds a branch to set things up and another one for failed lookups. clang probably also pessimized namei entry, which already is quite bad. Perhaps I should note there several single-threaded slowdowns remaining, most of them branches and this goes counter to whacking them.

it's adds only two comparison and conditional jumping for the native binary (namei does not restart for native)

I just realized your restart label is early enough that it virtually counts as a separate namei call, included repeated copy of the buffer et al. If paying that cost, you can *avoid* modifying any of it anyway and instead call namei with a faked pwd -- it would have all the usual vnodes *and* the one for /compat/linux sneaked in as root. Should this fail to produce the result, you use the real pwd. et voila, changing namei avoided.

you propose to preserve the ugly kern_alternate_path way ?

Another note is that should this kind of double lookup need to be optimized, the way to do it for lockless lookup would be to patch the case which finds a negative entry to check if perhaps another variant is needed. This would avoid *any* work for lookups which do succeed. Anyhow so far it looks like you should be fine rolling with *calling* namei twice.

Interesting, if you don't mind, could you please tell me where to start there, for lockless lookup part?

Implementing things you are talking about requires rewriting our Linux emulation ports infrastructure as they installs files into the base not in the ABI directory.

The current state is a mess and this sounds like an opportunity to sort it out, if feasible.

Our ports are still centos-7 (EOL in 2024), there is no one on the horizon who is willing to do anything with that

You can implement namei_altroot(struct nameidata *nd, struct vnode *altroot) (or whatever the name) where altroot is guaranteed v_usecount > 0. Then it can handle faking pwd for the first pass without polluting any consumers.

you could store the vnode in that linux-specific struct

also note this avoids growing struct pwd which right now is 32 bytes, fitting very nicely uma

sys/kern/kern_descrip.c
4028 ↗(On Diff #119938)

But why? Cannot the process be chrooted?

4038 ↗(On Diff #119938)

Same.

sys/kern/vfs_lookup.c
691

Should this check be more specific, e.g. only if the file was not found, instead of some other errors?

sys/kern/kern_descrip.c
4028 ↗(On Diff #119938)

I think I understand what you are concerned with. You want to prevent the chroot escape? Then perhaps, if the process is chrooted, it should get pwd_adir set to NULL.

BTW, what about jailed processes?

sys/kern/kern_descrip.c
4028 ↗(On Diff #119938)

I want to minimize the amount of checks for setup and restart namei() for the native ABI, so that it would be sufficient a comparison of pwd_adir vs pwd_rdir. So the pwd_adir should be changed if the process is jailed or if not chrooted. Also, chrooted ABI process should acts like native process from namei() perspective, ie not restarts name().

This condition could be checked in pwd_exec() or on a ABI side. The first is not effective due to pwd_alloc(), so I moved it to the ABI - linux_pwd_onexec().

Therefore asserts was used to garantie properly usage of pwd_exec(). Removed now.

Jails fixed. However, jexec $jail chroot /compat/$abi /bin/bash is not supposed to work in my POV, or it should?

sys/kern/kern_descrip.c
4028 ↗(On Diff #119938)

So there are two issues.

  1. We need to ensure that there is no jail or chroot escape. The pw_adir must point either to jail/chroot root, or to the new /compat/linux, after the op.
  2. Before the patch, /compat/linux was evaluated for each lookup. In particular, after the chroot, if the new chroot has its own /linux/compat, it worked, Also, if you changed /compat/linux, it also worked immediately.
In D38933#897702, @mjg wrote:

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux or whatever you are looking up against and then have nullfs mounts inside for /home, /tmp and whatever else which makes sense to share. This avoids any suspicious lookups like failing to find a file in Linux because it is missing when it should not and trying to pick up the FreeBSD one. This also avoids adding any complexity to the kernel.

FWIW, https://reviews.freebsd.org/D25501

In D38933#897702, @mjg wrote:

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux

No, please, lets don't do that. It'd be impossible to run some Linux text editor that accesses your $HOME.

rewored, allow emul_path in chroot

sys/kern/kern_descrip.c
4028 ↗(On Diff #119938)

So there are two issues.

  1. We need to ensure that there is no jail or chroot escape. The pw_adir must point either to jail/chroot root, or to the new /compat/linux, after the op.

Sure, pwd_adir is initialized in the pwd_chroot() or pwd_chroot_chdir() unconditionally, so it fully consistent with what you have written.
And to call pwd_exec() namei() is used to lookup adir vnode, so it cant escape jail or chroot. And this is a problem for #2 )))

  1. Before the patch, /compat/linux was evaluated for each lookup. In particular, after the chroot, if the new chroot has its own /linux/compat, it worked,

Reworked, look at linux_pwd_onexec(), please, now if emul_path exists in chroot or jail it is used. Thank you

Also, if you changed /compat/linux, it also worked immediately.

After the patch the running process will continue execution with proper environment. I like this behaviour.

So there is still a case. Imagine that Linux process is chrooted into a subtree with its own '/compat/linux'. It does not start using this new adir. Might be it needs a namei() in chroot (or rather it should be sysent method?).

Otherwise, it looks fine to me, mostly. It would be easier to proceed if you split this patch into _many_ pieces. For instance, you could move all Linux changes into separate commit. Same for the removal of sv_imgact_try. Not sure about core namei() changes.

dchagin retitled this revision from vfs: Allow ABI to translate symlinks according to the ABI prefix to namei: Add the abilty for the ABI to specify an alternate root path.May 14 2023, 6:06 PM
dchagin edited the summary of this revision. (Show Details)
In D38933#910452, @kib wrote:

So there is still a case. Imagine that Linux process is chrooted into a subtree with its own '/compat/linux'. It does not start using this new adir. Might be it needs a namei() in chroot (or rather it should be sysent method?).

ugh, now this code in https://reviews.freebsd.org/D40090

Im sorry, can’t imagine how it can, linux_pwd_exec calls namei() unconditionally, and returns namei error only if not in chroot (jail).
I’d say more, now the one can do jexec chroot /compat/ubuntu or jexec /compat/ubuntu/bin/bash, i.e., fully isolate Linuxulator.

Otherwise, it looks fine to me, mostly. It would be easier to proceed if you split this patch into _many_ pieces. For instance, you could move all Linux changes into separate commit. Same for the removal of sv_imgact_try. Not sure about core namei() changes.

done,

kib added inline comments.
sys/kern/kern_descrip.c
4013 ↗(On Diff #121935)

Name the 'vp' parameter more vividly, to indicate that this is the alternate root.
Might be, pwd_exec() should be also called more expressive.

This revision is now accepted and ready to land.May 24 2023, 5:54 AM