Page MenuHomeFreeBSD

Add MNT_EXJAIL to differentiate between exports in prisons vs prison0
ClosedPublic

Authored by rmacklem on Jan 21 2023, 10:04 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 7, 2:44 PM
Unknown Object (File)
Tue, Nov 5, 10:21 AM
Unknown Object (File)
Wed, Oct 16, 7:54 AM
Unknown Object (File)
Oct 4 2024, 11:11 PM
Unknown Object (File)
Oct 3 2024, 1:22 AM
Unknown Object (File)
Oct 2 2024, 6:36 PM
Unknown Object (File)
Oct 2 2024, 6:09 AM
Unknown Object (File)
Oct 1 2024, 1:43 PM
Subscribers

Details

Summary

mountd(8) basically does the following:

getmntinfo()
for each mount
      delete_exports

using nmount(2) to do the creation/deletion
of individual exports.

This works ok inside a prison, since getmntinfo()
only returns mounts visible within the prison and
the fspath argument to nmount(2) only works for
mounts within the prison's directory tree.

However, for prison0, this is a problem, since
getmntinfo() returns all mounts, including ones
in prisons.

By using a single bit in mnt_flag to indicate if a
export was done from within a prison, the addition
and deletion of exports done within a prison can
be avoided by mountd(8) when runniing in prison0
and vice versa.

This allows mountd(8) to be run both within prison(s)
and in prison0.

A more complex patch would be required to record
exactly which prison did the exports and I do not
believe that level of detail is needed at this time.

Test Plan

Tested in a system patched to allow mountd(8) to
run in vnet prisons.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Added include of proc.h so it will build
for non-VIMAGE kernels.

This works ok inside a prison, since getmntinfo() only returns mounts visible within the prison

Isn't this only true when the enforce_statfs jail parameter is set to 1? If it's set to 0, for example, the paths returned by getmntinfo() will be prefixed by the jail root path, which I suspect would confuse mountd. Does your work somehow handle this possibility?

A more complex patch would be required to record exactly which prison did the exports and I do not believe that level of detail is needed at this time.

The reasoning is that a credential would hold on to a reference to the prison, and that could prevent the prison from being destroyed?

A problem with using a single bit is that there's nothing preventing multiple jails from sharing mount points. I think it's reasonable to expect legitimate uses of jailed nfsd to simply avoid this possibility, but are there any security implications when NFSD-enabled jails share a mountpoint?

W.r.t. enforce_statfs...

enforce_statfs == 2 (the default)
--> works fine, but only allows the jail's root fs to be exported
enforce_statfs == 1
--> should allow file systems mounted within the jail to be

exported (although I have not yet tested this, since my test
setup does not have a convenient file system to mount inside
of a jail)

enforce_statfs == 0
--> Will cause confusion. For my testing, the mountd(8) running

in the jail ends up failing to do exports and just logs the
failures via syslog.
--> Is it possible that this could result in an export outside of
      a jail to be broken? I do not think so, since nmount(2)'s
      fspath is looked up relative to the jail's root, but I am not
      100% sure?

First, this needs to be documented. Probably adding a sentence or
two to the description of "allow.nfsd" in jail.8.

Second, I think the best thing is to come up with a patch for kern_jail.c that
prohibits the use of "allow.nfsd" with "enforce_statfs==0".
--> The only use for this case might be running mountd(8) only in

prison0, but having nfsd running in jail(s) with enforce_statfs==0.
- I do think it is better to just not allow this than trying to support
   the above odd case.

What do others think?

On jails sharing file systems...
prison_check_nfsd() makes sure that the prison's root is an fs mount
point. As such, there are only two ways multiple jails can share the
same fs.
1 - Use exactly the same root fs. (I have trouble seeing why anyone

would do this?)

2 - Multiple jails mount the same fs within the jails. Also seems strange

to me, but it only breaks exports if multiple jails try to export it.

I think documenting "don't do this" is sufficient. It the patch were expanded
to identify exactly which jail (by making the ident 64bits instead of 1 bit,
all it would do is result in error messages being logged by mountd(8) for
badly configured cases. (Both #1 and #2 above seem so weird that I just
do not see it worth the effor

W.r.t. enforce_statfs...

enforce_statfs == 2 (the default)
--> works fine, but only allows the jail's root fs to be exported
enforce_statfs == 1
--> should allow file systems mounted within the jail to be

exported (although I have not yet tested this, since my test
setup does not have a convenient file system to mount inside
of a jail)

enforce_statfs == 0
--> Will cause confusion. For my testing, the mountd(8) running

in the jail ends up failing to do exports and just logs the
failures via syslog.
--> Is it possible that this could result in an export outside of
      a jail to be broken? I do not think so, since nmount(2)'s
      fspath is looked up relative to the jail's root, but I am not
      100% sure?

First, this needs to be documented. Probably adding a sentence or
two to the description of "allow.nfsd" in jail.8.

Second, I think the best thing is to come up with a patch for kern_jail.c that
prohibits the use of "allow.nfsd" with "enforce_statfs==0".
--> The only use for this case might be running mountd(8) only in

prison0, but having nfsd running in jail(s) with enforce_statfs==0.
- I do think it is better to just not allow this than trying to support
   the above odd case.

What do others think?

That sounds reasonable to me. It's better to be restrictive and limit the set of cases that we handle. If a legitimate use case arises later we can perhaps find a way to relax this rule.

On jails sharing file systems...
prison_check_nfsd() makes sure that the prison's root is an fs mount
point. As such, there are only two ways multiple jails can share the
same fs.
1 - Use exactly the same root fs. (I have trouble seeing why anyone

would do this?)

2 - Multiple jails mount the same fs within the jails. Also seems strange

to me, but it only breaks exports if multiple jails try to export it.

I think documenting "don't do this" is sufficient. It the patch were expanded
to identify exactly which jail (by making the ident 64bits instead of 1 bit,
all it would do is result in error messages being logged by mountd(8) for
badly configured cases. (Both #1 and #2 above seem so weird that I just
do not see it worth the effor

I can't think of any instances of this offhand, but I'm not sure it's that weird. I can imagine having a set of network services which are isolated from each other using jails but still share a read-only root filesystem perhaps. I'm not sure that it makes a lot of sense but people are creative.

So long as we don't think there are security bugs lurking there (or if there are, then they're identified and documented), then I think "don't do that" is perfectly reasonable.

rmacklem added a reviewer: jamie.

This version uses a 64bit ident so that the jails
can be differentiated. It would handle the case
of multiple prisons (not prison0) trying to export
the same file system.

  • It adds a 64bit field to both "struct prison" and "struct mount".

I'll let others decide if this is preferred over the 1bit
version.

I noticed that there are old versions of "struct prison"
in sys/jail.h. Is this necessary when "struct prison" is
revised?

I noticed that there are old versions of "struct prison"
in sys/jail.h. Is this necessary when "struct prison" is
revised?

Those are only for the user interface (struct jail to set, struct xprison to get). The modern user interface isn't based on per-jail structs at all, so those are now frozen in time.

sys/sys/jail.h
209

This would fit better after pr_hostid, since it's also a 64-bit int. Unfortunately, neither of the pr_spare arrays are suitable for a long int.

I don't especially like the name pr_ident, which isn't readily differentiated from pr_id. Perhaps pr_permid? In a perfect world, I wouldn't have allowed prison IDs to be reused, but it's too late to stuff that horse back into the barn. Probably - I have considered actually changing the policy, but it would have POLA problems and would need a slow rollout with proper deprecation warnings.

There's already a struct prison pointer in mnt_cred. Shouldn't that do? A removed prison would already necessitate clearing/changing that cred, so there's no problem of address reuse.

Rename pr_ident to pr_permid and put it after
pr_hostid, as suggested by jamie@.
(Although I will note that "unsigned long" is
only 64bits on some arches. It is one of my pet
peives w.r.t. ANSI C.)

W.r.t. "what about mnt_cred?". Most of the
complexity in D37741, D38144 is needed because
mountd(8) running in a prison needs to export
a file system mounted outside of the prison
(but visible within the prison's directory tree,
such as the prison's root fs).
As such, the credentials used for the mount are
not the credentials used when nmount(2) is
used by mountd(8) to update exports on the
mount point.

sys/kern/vfs_export.c
332

It seems like removing a jail that has exported any filesystems renders those exports permanent. prison0 can't do it because the jail_permid is different, and no jail with that permid exists (or ever will again).

I had planned on adding a check for jail is alive,
but it slipped through the cracks. Thanks Jamie@
for catching this.

This version of the patch uses a function I called
prison_isalive_permid() to check if the prison
identified by the permid is still alive.
This way, vfs_export() will allow deletion/update
of the exports after the prison that did them is
dying.

I also set LK_NOWITNESS on mnt_explock, since
the code generates a harmless LOR (avoided by
not doing the call to prison_isalive_permid() for
the case where allprisons is already held (during
cleanup of a jail when it is dying).

The extra argument to vfs_export() does this by
allowing the checks for same prison to be overridden.

rmacklem added inline comments.
sys/kern/vfs_export.c
332

prison_check_alive should allow the exports to
be deleted/updated after the prison is dying.

The alive check raises a question: what if instead of a loop in vfs_export that checks prisons, how about a loop in prison_cleanup that checks exports? It could go through the mount list and any exports that belong to it can be refiled for the parent.

The advantage is it obviates the need for pr_permid, since at that point, a prison pointer would be a valid indicator. The disadvantage is there will always be mount points to check on jail removal, but there won't always be jails to check on exporting. And for that matter, in system where you're using VNET NFS, I would always expect more mount points than jails.

sys/kern/vfs_export.c
302

Why the choice on prison_check_alive? Is there a time we wouldn't want it?

W.r.t. jamie@'s comment....

Exports get updated at any time via vfs_export().
Basically whenever a sysadmin updates /etc/exports
and sends a SIGHUP to mountd. (Or kills/restarts
mountd.)

The prison_cleanup() case is an outlier. I thought of
trying to clear out exports done by a prison when
that prison terminates, but the locking related to
the mountlist, etc. is not simple.
It goes something like...

  • Find a mount in the list that has exports for the jail.
  • vfs_busy that mount point
  • drop the mountlist mutex so you can grab the

lockmgr lock called mnt_explock and get rid of the exports

  • vfs_unbusy the mount point
  • grab the mountlist mutex and look for another one

done in a loop until none found.

Complicated and possibly racey. (What happens when
processes like mountd change exports while the above
is in progress. I suppose you can say it will eventually
complete once all of the ones done by the dying prison
have been deleted, but?? I know of an NFS server that
exports more than 89,000 file systems, although obviously
not in jails. However, all of those 89,000 file systems are
in the mount list and if you traverse that list many times
searching for the ones to delete...

Just doing the checks in vfs_export() and allowing the
exports to be changed when the prison is no longer alive
is just soo much easier.

mnt_exjail/pr_permid is also used by the nfsd kernel code
to decide if an export is usable by an nfsd thread
ie. Different prison means not exported for this nfsd.
I suppose if the exports were all deleted when the prison
is dying, pr_id would be sufficient for this, but as noted above,
I don't want to try and delete all the exports done by a
prison when it dies.

sys/kern/vfs_export.c
302

Well, for this patch, I was calling vfs_export() in
nfsrv_cleanup() { called via osd when the prison
is dying }. As such, the allprisons lock was already held.

It turns out that the LOR this causes is not harmless
(A mountd process running elsewhere could deadlock
with a "jail -r".) As such, I am now testing a different
patch which doesn;t call vfs_export(). This means that
the additional argument is no longer needed.

I will be uploading an updated patch once I have it tested.

I was wrong. The LOR was significant and could cause
a deadlock. To avoid the LOR, this version of the patch
makes vfs_free_addrlist() global so it can be used to
clean up the NFSv4 root exports.
This avoids the LOR caused by calling vfs_export() from
the ods method, where the allprisons lock was already held.

The extra argument to vfs_export() was no longer needed,
so that was removed.

I also added a new (e) lock definition to the struct mount
comment, that indicates mnt_export and mnt_exjail are
protected by mnt_explock.

sys/kern/vfs_export.c
292

Perhaps this should clear the netc_anon pointer for good measure, to avoid a dangling pointer?

324

Do you actually need the special case for permid == 1?

329

Can this assertion be triggered by usermode simply by trying to un-export an unexported mount point?

362–368

Should the error paths below be clearing mnt_export before returning? Otherwise we can end up with a mount point that has mnt_export != NULL && mnt_exjail == 0.

Made a couple of changes based on markj@'s comments.

For some freekin reason, markj@'s comments are no
longer inline and I can't figure out how to get them back in.
Here's the replies I would have typed inline:

  • Now sets netc_anon = NULL, although "nep" gets

free'd right after the call.

  • Special case of permid == 1 is not needed, but since permid == 1 refers to prison0, it avoids the prison_isalive_permid(), which locks/unlocks allprisons sx.
  • Last two are really the same problem. The simple fix was to set

mnt_exjail as soon as nep is malloc()'d. This way, if nep != NULL,
mnt_exjail != 0. (Alternately, the code could use MNT_EXPORTED instead
of "nep" checked against NULL, but that would require MNT_ILOCK()s
around the checks.

I forget you can have the case when nep != NULL, but there is no
addrlist hanging off it. (That happens when there are "default"
export options that cover all addresses.)

For some freekin reason, markj@'s comments are no
longer inline and I can't figure out how to get them back in.

Sorry, no idea what that's about. Maybe the number of comments on the review went over some threshold.

Here's the replies I would have typed inline:

  • Now sets netc_anon = NULL, although "nep" gets

free'd right after the call.

  • Special case of permid == 1 is not needed, but since permid == 1 refers to prison0, it avoids the prison_isalive_permid(), which locks/unlocks allprisons sx.

I think that should be handled within prison_isalive_permid() then.

  • Last two are really the same problem. The simple fix was to set

mnt_exjail as soon as nep is malloc()'d. This way, if nep != NULL,
mnt_exjail != 0. (Alternately, the code could use MNT_EXPORTED instead
of "nep" checked against NULL, but that would require MNT_ILOCK()s
around the checks.

I forget you can have the case when nep != NULL, but there is no
addrlist hanging off it. (That happens when there are "default"
export options that cover all addresses.)

So now, when trying to set MNT_EXPORTED, if MNT_EXPUBLIC is set and vfs_setpublicfs() fails, isn't it incorrect to leave nep != NULL? Or maybe it's ok because MNT_EXPORTED hasn't been set yet? I'm not sure. There's a lot of different state to keep track of. :)

sys/kern/vfs_export.c.new
385 ↗(On Diff #116241)

Is this assignment still needed? It doesn't look like it.

I replied to the comments without updating the patch,
just in case. After I hear if you think if freeing *nep
would make it more readable, I will update the patch.

I will move the check for permid == 1 and permid == 0
into prison_isalive_permid().

sys/kern/vfs_export.c.new
384 ↗(On Diff #116241)

For this case (and for the vfs_setpublicfs() above)
failing for a newly allocated *nep, all that happens
is the structure is all 0s, which means nothing exported
yet.

If you think it will make the code more readable, I can
add code to free a newly allocated *nep for this case.
Do you think that would be better?

385 ↗(On Diff #116241)

Yes, for the case where the prison is no
longer alive and is now being exported
by a different prison.

With the prison_isalive_permid() tweak, I think this is ok.

sys/kern/vfs_export.c.new
384 ↗(On Diff #116241)

If leaving it allocated doesn't have any problematic side effects, then I think it's ok to leave it. I slightly prefer to do the extra cleanup, even if it's unnecessary, since that just leaves us with fewer cases to reason about, but I don't feel strongly about it.

385 ↗(On Diff #116241)

I see now, thanks.

This revision is now accepted and ready to land.Feb 2 2023, 5:17 PM

Just fyi, I did add code that free'd new and set mnt_export = NULL
for the two error cases, if nep is a newly allocated one with no
export information in it.

I'm rather confused by this patch, I think it only perpetuates the old anti-idiom of storing ids instead of target objs, which I suspect showed up as a replacement for even older idea of storing pointers and checking generation IDs.

Anyhow, the key I don't like here is the added linear scan in prison_isalive_permid, especially since it looks trivially avoidable.

As the mnt_exjail thing is now u64, you may as well store mnt_excred instead. Access could be protected with vfs_busy/unbusy or even mnt_mtx.

then this:

jail_permid = curthread->td_ucred->cr_prison->pr_permid;
lockmgr(&mp->mnt_explock, LK_EXCLUSIVE, NULL);
nep = mp->mnt_export;
prison_alive = prison_isalive_permid(mp->mnt_exjail);

becomes this:

lockmgr(&mp->mnt_explock, LK_EXCLUSIVE, NULL);
nep = mp->mnt_export;
mtx_lock(&mp->mnt_mtx);
matchingprison = curthread->td_ucred->cr_prison == mp->mnt_excred->cr_prison;
mtx_unlock(&mp->mnt_mtx);

And how does storing cr_prison in mnt_excred work after
the prison is removed?

If prison removal requires a search through the mountlist,
I think that could be a lot worse than the linear search
through the prison list.

I know of an NFS server with over 89,000 file systems exported,
which means they are all in the mountlist.

Just to clarify this, since what I said in the last comment
was not well worded.

If there are references to a cred for a prison left in mnt_excred,
then "jail -r" will get stuck in dying state, as I understand prisons.

The code in kern_jail.c that does "jail -r" plus other things
is way too convoluted for me to try and figure out where
a run through mountlist, getting rid of all those references
can safely be done. It must be at a point where the "jail -r"
will not fail leaving the jail running, but before it will get stuck
in "dying" state, due to the cred reference. (I assume that "jail -r"
is a rare event, so scanning the mountlist for these will not be
an issue even when there are 89,000 of them.)

If jamie@ or yourself can figure out where/how to derefernce
the mnt_excred credentials during "jail -r", then I have no problem
switching mnt_exjail from being a 64bit id to being a "struct ucred *".

Note that calls to vfs_export() only happen when mountd is started
up and for exports that have been changed in /etc/exports when
mountd is given a SIGHUP, so the linear scan through the jail list
should not happen that frequently.

Oh, I remember jamie@ mentioning the jail cleanup
method.

Would the call to the PR_METHOD_REMOVE function
be the right place to get rid of the credential references
in the mountlist?

If so, I can now see how that could work.
Before I had mistakenly thought I would need to hold
the mnt_explock to get rid of mnt_excred credential
references, but I can see that can be done with just
mutexes held, so it is straightforward, if doing it in
the PR_METHOD_REMOVE function is the correct time?

Oh, I remember jamie@ mentioning the jail cleanup
method.

Would the call to the PR_METHOD_REMOVE function
be the right place to get rid of the credential references
in the mountlist?

Right, almost. As you mentioned in D38371, directly in prison_cleanup() is probably better.

Although it no longer allows this to be abandoned,
I consider it abaondoned and the commit has been
reverted.

I think D38371 is a better solution.