Details

Reviewers

kib
tijl
manu

Group Reviewers

Linux Emulation

Commits

rG3d2fec7db856: namei: Add the abilty for the ABI to specify an alternate root path

Summary

For now a non-native ABI (i.e., Linux) uses the kern_alternate_path()
facility to dynamically reroot lookups. First, an attempt is made to
lookup the file in /compat/linux/original-path. If that fails, the
lookup is done in /original-path. Thats requires a bit of code in
every ABI syscall implementation where path name translation is needed.
Also our kern_alternate_path() does not properly lookups absolute symlinks
in second attempt, i.e., does not append /compat/linux part to the resolved
link.
The change is intended to avoid this by specifiyng the ABI root directory
for namei(), using one call to pwd_exec() during exec-time into the ABI.
In that case namei() will dynamically reroot lookups as mentioned above.

PR: 72920
MFC after: 1 month

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 50173
Build 47065: arc lint + arc unit

Event Timeline

dchagin created this revision.Mar 6 2023, 7:58 PM

Herald added a reviewer: manu. · View Herald TranscriptMar 6 2023, 7:58 PM

Herald added subscribers: emaste, andrew, imp. · View Herald Transcript

dchagin requested review of this revision.Mar 6 2023, 7:58 PM

Harbormaster completed remote builds in B50173: Diff 118420.Mar 6 2023, 7:58 PM

dchagin edited reviewers, added: Linux Emulation, kib, tijl; removed: manu.Mar 6 2023, 8:03 PM

dchagin added a project: Linux Emulation.

Herald added a reviewer: manu. · View Herald TranscriptMar 6 2023, 8:03 PM

remove linux_alternate_interp, it cold be next

Harbormaster completed remote builds in B50174: Diff 118421.Mar 6 2023, 8:08 PM

dchagin added inline comments.Mar 6 2023, 8:16 PM

sys/kern/imgact_elf.c
827	this can be avoided if namei() clean ni_vp on error path

I suspect this is all done in the wrong place/moment. What you need is to specially handle root, not the symlink. Then kern_alternate_path() can go away as well I suspect.

sys/compat/linux/linux_util.c
116	Why do you need ncp? Why not move cp content further up and then directly put linux_emul_path into the right place?

In D38933#886562, @kib wrote:

I suspect this is all done in the wrong place/moment. What you need is to specially handle root, not the symlink. Then kern_alternate_path() can go away as well I suspect.

Are you mean by 'specially handle root' change root aka chroot()? I suspect, you are mean something else.
In my POV kern_alternate_path() is a big issue, it does more unexpected harm than helpful, however, retirement it is a POLA. Or we must have a way to 'mount" some files from /etc into this root.

At least it would be good to change root for the exec() time to guarantee that Linux rtld will not look at non-native directories.

In D38933#886684, @dchagin wrote:

In D38933#886562, @kib wrote:

I suspect this is all done in the wrong place/moment. What you need is to specially handle root, not the symlink. Then kern_alternate_path() can go away as well I suspect.

Are you mean by 'specially handle root' change root aka chroot()? I suspect, you are mean something else.
In my POV kern_alternate_path() is a big issue, it does more unexpected harm than helpful, however, retirement it is a POLA. Or we must have a way to 'mount" some files from /etc into this root.

At least it would be good to change root for the exec() time to guarantee that Linux rtld will not look at non-native directories.

No, I mean a place where the "/" is resolved in namei(). The retry with alternative path should happen somewhere around places near the calls to namei_handle_root().

arrowd added a subscriber: arrowd.Mar 8 2023, 7:21 AM

Redesign, I don't like the interaction between the pwd_exec() and the ABI
due to v_usecount should be incremented in the ABI path, but seems there no
other way? I prefer to not use sx instead of rmlock to call pwd_exec() with sx held

Here is a comparison between the stock kernel and the modified kernel: https://people.freebsd.org/~dchagin/adir/

Herald added a subscriber: jhibbits. · View Herald TranscriptMar 21 2023, 9:24 PM

Harbormaster completed remote builds in B50503: Diff 119214.Mar 21 2023, 9:24 PM

grahamperrin added a subscriber: grahamperrin.Mar 22 2023, 11:03 PM

kib added inline comments.Mar 24 2023, 12:43 AM

sys/kern/kern_descrip.c
4010 ↗	(On Diff #119214)
4035 ↗	(On Diff #119214)
sys/sys/filedesc.h
94 ↗	(On Diff #119214)

done

Harbormaster completed remote builds in B50577: Diff 119434.Mar 25 2023, 8:16 AM

Back to the char linux_emul_path[], this simplifies the code, adding one namei() call to execve() path.

Linprocfs changes moved to a separate commit

Harbormaster completed remote builds in B50778: Diff 119913.Apr 5 2023, 5:36 PM

dchagin mentioned this in D39438: linprocfs: Rework according to the new struct pwd facility.Apr 5 2023, 5:49 PM

dchagin added a child revision: D39438: linprocfs: Rework according to the new struct pwd facility.Apr 5 2023, 5:50 PM

dchagin mentioned this in D35544: linux(4): Implement xattr syscalls.Apr 5 2023, 5:55 PM

dchagin added a child revision: D35543: vfs: Export exattr methods to reuse by Linuxulator.Apr 5 2023, 5:55 PM

dchagin added a child revision: D35544: linux(4): Implement xattr syscalls.

Something does not add up whatsoever with your bench results -- how is this patch supposed to improve scalability for open/close/unlink? Similarly it can't be faster than stock code for regular lookups, but this one is perhaps measurement error.

That aside the entire idea of Linux binaries doing 2 lookups was incredibly dodgy from the get go and I don't think this is helping the fundamental problem, albeit it may be it makes it less iffy.

The added branchfest is definitely not nice, especially the restart clause.

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux or whatever you are looking up against and then have nullfs mounts inside for /home, /tmp and whatever else which makes sense to share. This avoids any suspicious lookups like failing to find a file in Linux because it is missing when it should not and trying to pick up the FreeBSD one. This also avoids adding any complexity to the kernel.

Even if going this route, I think the functionality can be added without pessimizing existing code. Note that for example vfs_lookup is already a standalone routine.

In D38933#897702, @mjg wrote:

Something does not add up whatsoever with your bench results -- how is this patch supposed to improve scalability for open/close/unlink?

This patch is not supposed to improve scalability, I used will-it-scale to check that I do not broke hot path.

That aside the entire idea of Linux binaries doing 2 lookups was incredibly dodgy from the get go and I don't think this is helping the fundamental problem, albeit it may be it makes it less iffy.

The added branchfest is definitely not nice, especially the restart clause.

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux or whatever you are looking up against and then have nullfs mounts inside for /home, /tmp and whatever else which makes sense to share. This avoids any suspicious lookups like failing to find a file in Linux because it is missing when it should not and trying to pick up the FreeBSD one. This also avoids adding any complexity to the kernel.

Even if going this route, I think the functionality can be added without pessimizing existing code. Note that for example vfs_lookup is already a standalone routine.

Well, I mostly agree with all of you statement. Except some sort of pessimization, I tried to minimize touching of the hot path for native binaries.
This patch adds only two compares (pwd->pwd_rdir != pwd->pwd_adir) on error path of namei() for native. I don't think it's worth the cost.

Implementing things you are talking about requires rewriting our Linux emulation ports infrastructure as they installs files into the base not in the ABI directory.

dchagin mentioned this in D35543: vfs: Export exattr methods to reuse by Linuxulator.Apr 5 2023, 7:09 PM

In D38933#897717, @dchagin wrote:

In D38933#897702, @mjg wrote:

Something does not add up whatsoever with your bench results -- how is this patch supposed to improve scalability for open/close/unlink?

This patch is not supposed to improve scalability, I used will-it-scale to check that I do not broke hot path.

I am saying that according to the graph it did improve, markedly so, but this can't be true and consequently the bench is bogus.

That aside the entire idea of Linux binaries doing 2 lookups was incredibly dodgy from the get go and I don't think this is helping the fundamental problem, albeit it may be it makes it less iffy.

The added branchfest is definitely not nice, especially the restart clause.

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux or whatever you are looking up against and then have nullfs mounts inside for /home, /tmp and whatever else which makes sense to share. This avoids any suspicious lookups like failing to find a file in Linux because it is missing when it should not and trying to pick up the FreeBSD one. This also avoids adding any complexity to the kernel.

Even if going this route, I think the functionality can be added without pessimizing existing code. Note that for example vfs_lookup is already a standalone routine.

Well, I mostly agree with all of you statement. Except some sort of pessimization, I tried to minimize touching of the hot path for native binaries.
This patch adds only two compares (pwd->pwd_rdir != pwd->pwd_adir) on error path of namei() for native. I don't think it's worth the cost.

It adds a branch to set things up and another one for failed lookups. clang probably also pessimized namei entry, which already is quite bad. Perhaps I should note there several single-threaded slowdowns remaining, most of them branches and this goes counter to whacking them.

I just realized your restart label is early enough that it virtually counts as a separate namei call, included repeated copy of the buffer et al. If paying that cost, you can *avoid* modifying any of it anyway and instead call namei with a faked pwd -- it would have all the usual vnodes *and* the one for /compat/linux sneaked in as root. Should this fail to produce the result, you use the real pwd. et voila, changing namei avoided.

Another note is that should this kind of double lookup need to be optimized, the way to do it for lockless lookup would be to patch the case which finds a negative entry to check if perhaps another variant is needed. This would avoid *any* work for lookups which do succeed. Anyhow so far it looks like you should be fine rolling with *calling* namei twice.

Implementing things you are talking about requires rewriting our Linux emulation ports infrastructure as they installs files into the base not in the ABI directory.

The current state is a mess and this sounds like an opportunity to sort it out, if feasible.

I was so startled by the supposed scalability diff I did not take a proper look at the other results.

700k ops/s is abysmall, perhaps your cpu is shafted with meltdown or you are running a debug kernel or fast path lookup is disabled in your case

What's more important is that it flattens for scalability, which would not happen for separate file case. I just verified it scales almost linearly up to 52 workers (i don't have more cores on my test box).

tl;dr the test is defo wrong

In D38933#897702, @mjg wrote:

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux or whatever you are looking up against and then have nullfs mounts inside for /home, /tmp and whatever else which makes sense to share. This avoids any suspicious lookups like failing to find a file in Linux because it is missing when it should not and trying to pick up the FreeBSD one. This also avoids adding any complexity to the kernel.

This functionality (double lookup in the ugly current form) was added exactly to avoid requiring users doing what you described above.

sys/kern/kern_descrip.c
4035 ↗	(On Diff #119913)	After re-reading, _unexec is the weird name. Could you merge pwd_exec with pwd_unexec, indicating unexec case by vp == NULL?

done

Harbormaster completed remote builds in B50789: Diff 119938.Apr 6 2023, 11:46 AM

In D38933#897720, @mjg wrote:

In D38933#897717, @dchagin wrote:

In D38933#897702, @mjg wrote:

Something does not add up whatsoever with your bench results -- how is this patch supposed to improve scalability for open/close/unlink?

This patch is not supposed to improve scalability, I used will-it-scale to check that I do not broke hot path.

I am saying that according to the graph it did improve, markedly so, but this can't be true and consequently the bench is bogus.

That aside the entire idea of Linux binaries doing 2 lookups was incredibly dodgy from the get go and I don't think this is helping the fundamental problem, albeit it may be it makes it less iffy.

The added branchfest is definitely not nice, especially the restart clause.

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux or whatever you are looking up against and then have nullfs mounts inside for /home, /tmp and whatever else which makes sense to share. This avoids any suspicious lookups like failing to find a file in Linux because it is missing when it should not and trying to pick up the FreeBSD one. This also avoids adding any complexity to the kernel.

Even if going this route, I think the functionality can be added without pessimizing existing code. Note that for example vfs_lookup is already a standalone routine.

Well, I mostly agree with all of you statement. Except some sort of pessimization, I tried to minimize touching of the hot path for native binaries.
This patch adds only two compares (pwd->pwd_rdir != pwd->pwd_adir) on error path of namei() for native. I don't think it's worth the cost.

It adds a branch to set things up and another one for failed lookups. clang probably also pessimized namei entry, which already is quite bad. Perhaps I should note there several single-threaded slowdowns remaining, most of them branches and this goes counter to whacking them.

it's adds only two comparison and conditional jumping for the native binary (namei does not restart for native)

I just realized your restart label is early enough that it virtually counts as a separate namei call, included repeated copy of the buffer et al. If paying that cost, you can *avoid* modifying any of it anyway and instead call namei with a faked pwd -- it would have all the usual vnodes *and* the one for /compat/linux sneaked in as root. Should this fail to produce the result, you use the real pwd. et voila, changing namei avoided.

you propose to preserve the ugly kern_alternate_path way ?

Another note is that should this kind of double lookup need to be optimized, the way to do it for lockless lookup would be to patch the case which finds a negative entry to check if perhaps another variant is needed. This would avoid *any* work for lookups which do succeed. Anyhow so far it looks like you should be fine rolling with *calling* namei twice.

Interesting, if you don't mind, could you please tell me where to start there, for lockless lookup part?

Implementing things you are talking about requires rewriting our Linux emulation ports infrastructure as they installs files into the base not in the ABI directory.

The current state is a mess and this sounds like an opportunity to sort it out, if feasible.

Our ports are still centos-7 (EOL in 2024), there is no one on the horizon who is willing to do anything with that

You can implement namei_altroot(struct nameidata *nd, struct vnode *altroot) (or whatever the name) where altroot is guaranteed v_usecount > 0. Then it can handle faking pwd for the first pass without polluting any consumers.

you could store the vnode in that linux-specific struct

also note this avoids growing struct pwd which right now is 32 bytes, fitting very nicely uma

kib added inline comments.Apr 7 2023, 4:42 AM

sys/kern/kern_descrip.c
4028 ↗	(On Diff #119938)	But why? Cannot the process be chrooted?
4038 ↗	(On Diff #119938)	Same.
sys/kern/vfs_lookup.c
691	Should this check be more specific, e.g. only if the file was not found, instead of some other errors?

kib added inline comments.Apr 7 2023, 4:49 AM

sys/kern/kern_descrip.c
4028 ↗	(On Diff #119938)	I think I understand what you are concerned with. You want to prevent the chroot escape? Then perhaps, if the process is chrooted, it should get pwd_adir set to NULL. BTW, what about jailed processes?

Done

Harbormaster completed remote builds in B50805: Diff 119981.Apr 8 2023, 9:16 AM

dchagin added inline comments.Apr 8 2023, 9:17 AM

sys/kern/kern_descrip.c
4028 ↗	(On Diff #119938)	I want to minimize the amount of checks for setup and restart namei() for the native ABI, so that it would be sufficient a comparison of pwd_adir vs pwd_rdir. So the pwd_adir should be changed if the process is jailed or if not chrooted. Also, chrooted ABI process should acts like native process from namei() perspective, ie not restarts name(). This condition could be checked in pwd_exec() or on a ABI side. The first is not effective due to pwd_alloc(), so I moved it to the ABI - linux_pwd_onexec(). Therefore asserts was used to garantie properly usage of pwd_exec(). Removed now. Jails fixed. However, jexec $jail chroot /compat/$abi /bin/bash is not supposed to work in my POV, or it should?

kib added inline comments.Apr 9 2023, 9:53 PM

sys/kern/kern_descrip.c
4028 ↗	(On Diff #119938)	So there are two issues. We need to ensure that there is no jail or chroot escape. The pw_adir must point either to jail/chroot root, or to the new /compat/linux, after the op. Before the patch, /compat/linux was evaluated for each lookup. In particular, after the chroot, if the new chroot has its own /linux/compat, it worked, Also, if you changed /compat/linux, it also worked immediately.

In D38933#897702, @mjg wrote:

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux or whatever you are looking up against and then have nullfs mounts inside for /home, /tmp and whatever else which makes sense to share. This avoids any suspicious lookups like failing to find a file in Linux because it is missing when it should not and trying to pick up the FreeBSD one. This also avoids adding any complexity to the kernel.

FWIW, https://reviews.freebsd.org/D25501

In D38933#897702, @mjg wrote:

I strongly suspect the right way is to have linux binaries auto chrooted to /compat/linux

No, please, lets don't do that. It'd be impossible to run some Linux text editor that accesses your $HOME.

rewored, allow emul_path in chroot

Harbormaster completed remote builds in B50977: Diff 120514.Apr 17 2023, 10:28 PM

dchagin added inline comments.Apr 17 2023, 10:30 PM

sys/kern/kern_descrip.c
4028 ↗	(On Diff #119938)	So there are two issues. We need to ensure that there is no jail or chroot escape. The pw_adir must point either to jail/chroot root, or to the new /compat/linux, after the op. Sure, pwd_adir is initialized in the pwd_chroot() or pwd_chroot_chdir() unconditionally, so it fully consistent with what you have written. And to call pwd_exec() namei() is used to lookup adir vnode, so it cant escape jail or chroot. And this is a problem for #2 ))) Before the patch, /compat/linux was evaluated for each lookup. In particular, after the chroot, if the new chroot has its own /linux/compat, it worked, Reworked, look at linux_pwd_onexec(), please, now if emul_path exists in chroot or jail it is used. Thank you Also, if you changed /compat/linux, it also worked immediately. After the patch the running process will continue execution with proper environment. I like this behaviour.

PR: 72920

Herald added a subscriber: riscv. · View Herald TranscriptApr 26 2023, 8:45 PM

rebase to main

Harbormaster completed remote builds in B51251: Diff 121184.Apr 28 2023, 9:02 AM

So there is still a case. Imagine that Linux process is chrooted into a subtree with its own '/compat/linux'. It does not start using this new adir. Might be it needs a namei() in chroot (or rather it should be sysent method?).

Otherwise, it looks fine to me, mostly. It would be easier to proceed if you split this patch into _many_ pieces. For instance, you could move all Linux changes into separate commit. Same for the removal of sv_imgact_try. Not sure about core namei() changes.

split

Harbormaster completed remote builds in B51483: Diff 121935.May 14 2023, 5:55 PM

dchagin retitled this revision from vfs: Allow ABI to translate symlinks according to the ABI prefix to namei: Add the abilty for the ABI to specify an alternate root path.May 14 2023, 6:06 PM

dchagin edited the summary of this revision. (Show Details)

dchagin added child revisions: D40090: linux(4): Use pwd_exec() to tell namei() about ABI root path, D40092: sysentvec: Retire sv_imgact_try as unneeded anymore, D40091: Brandinfo: Retire emul_path as unneeded anymore, D40093: vfs: Retire kern_alternate_path() as unused anymore.

In D38933#910452, @kib wrote:

So there is still a case. Imagine that Linux process is chrooted into a subtree with its own '/compat/linux'. It does not start using this new adir. Might be it needs a namei() in chroot (or rather it should be sysent method?).

ugh, now this code in https://reviews.freebsd.org/D40090

Im sorry, can’t imagine how it can, linux_pwd_exec calls namei() unconditionally, and returns namei error only if not in chroot (jail).
I’d say more, now the one can do jexec chroot /compat/ubuntu or jexec /compat/ubuntu/bin/bash, i.e., fully isolate Linuxulator.

Otherwise, it looks fine to me, mostly. It would be easier to proceed if you split this patch into _many_ pieces. For instance, you could move all Linux changes into separate commit. Same for the removal of sv_imgact_try. Not sure about core namei() changes.

done,

kib accepted this revision.May 24 2023, 5:54 AM

kib added inline comments.

sys/kern/kern_descrip.c
4013 ↗	(On Diff #121935)	Name the 'vp' parameter more vividly, to indicate that this is the alternate root. Might be, pwd_exec() should be also called more expressive.

This revision is now accepted and ready to land.May 24 2023, 5:54 AM

Closed by commit rG3d2fec7db856: namei: Add the abilty for the ABI to specify an alternate root path (authored by dchagin). · Explain WhyMay 29 2023, 8:20 AM

This revision was automatically updated to reflect the committed changes.

dchagin added a commit: rG3d2fec7db856: namei: Add the abilty for the ABI to specify an alternate root path.

trasz mentioned this in D25501: Autochroot prototype.Jun 7 2023, 12:28 PM

namei: Add the abilty for the ABI to specify an alternate root path
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 118420

sys/amd64/linux/linux_sysvec.c

sys/amd64/linux32/linux32_sysvec.c

sys/arm64/linux/linux_sysvec.c

sys/compat/linux/linux_util.h

sys/compat/linux/linux_util.c

sys/i386/linux/linux_sysvec.c

sys/kern/imgact_elf.c

sys/kern/vfs_lookup.c

sys/sys/namei.h

sys/sys/sysent.h

namei: Add the abilty for the ABI to specify an alternate root pathClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 118420

sys/amd64/linux/linux_sysvec.c

sys/amd64/linux32/linux32_sysvec.c

sys/arm64/linux/linux_sysvec.c

sys/compat/linux/linux_util.h

sys/compat/linux/linux_util.c

sys/i386/linux/linux_sysvec.c

sys/kern/imgact_elf.c

sys/kern/vfs_lookup.c

sys/sys/namei.h

sys/sys/sysent.h

namei: Add the abilty for the ABI to specify an alternate root path
ClosedPublic
Actions

Revision Contents
Changeset List