Details

Reviewers

markj
kib
asomers

Commits

rG4bbbd5875d32: vfs_mount.c: Allow mountd(8) to do exports in a vnet prison

Summary

To run mountd in a vnet prison, three checks in vfs_domount()
and vfs_domount_update() related to doing exports needed
to be changed.

I did all three in a minimal way, only changing the checks for
the specific case of a process (typically mountd) doing exports
within a vnet prison.
(For example, I suspect VFS_MOUNT() never needs to be called
when doing exports, but I only disabled the call for the "in jail"
case. It is highly unlikely, but conceivable, that an "out of source tree"
file system or application doing exports might need the VFS_MOUNT()
call when doing exports.)

All three cases are commented and #ifdef'd VNET_NFSD.

Test Plan

Running mountd both inside and outside of a jail
when built with VNET_NFSD has been done and
appears to work correctly for ZFS and UFS exports.
(I have not had the chance to test exports of other
file system types yet, but since there does not
appear to be references to MNT_EXPORTED in
the VFS_mount() call for any in tree file systems,
I doubt there will be a problem. I also doubt that
anyone will need to export other file systems than
UFS or ZFS from a vnet prison.)

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

rmacklem created this revision.Dec 18 2022, 9:13 PM

Herald added a subscriber: imp. · View Herald TranscriptDec 18 2022, 9:13 PM

rmacklem requested review of this revision.Dec 18 2022, 9:13 PM

VNET_NFSD is not defined as a config option. I see it is already used in kern_jail.c. Perhaps it should be added to conf/options?

But I do not quite understand why do you need it at all. The change in behavior is already gated by prison_check_nfsd(), so even if you are worried about regressions, user needs to perform the manual action to even get at this code path. IMO the #ifdef only adds a (minor) inconvenience to users, by forcing them to add another option to custom configs.

Well, this patch is just one small part of the patch in
D37519 which, if you look, is rather large.

The main reason for VNET_NFSD is that D37519 without
this will bloat the vnet structure by adding a lot of vnet
"variables" for all VIMAGE builds (including GENERIC).

When adding vnet support to mountd/nfsd was discussed
on freebsd-current, there was concern w.r.t. this bloat.

So, since few will actually want to run mountd/nfsd in
a vnet jail, I coded it so the globals only get added to
the vnet if VNET_NFSD is defined as a build option.
(If you take a look at the new file called nfsdvnet.h
in D37519, you'll see that it basically turns VNET()
and friends into no-ops unless VNET_NFSD is
defined.)
--> Also, if you always make the globals a part of

the vnet if VIMAGE is defined, nfsd.ko won't
load, since it complains the "vnet is out of space".
(I recall you are one of the people that likes to
 load nfsd.ko instead of building a kernel with
 options NFSD.)

--> Because of this, VNET_NFSD also requires VIMAGE

and NFSD. This is checked at compile time in
D37519.

D37519 does define VNET_NFSD as a global kernel
option, but I was not intending on adding that until
all of D37519 is committed, since it will only work then.

I am trying to get the small parts of D37519 that touch
code outside of the krpc/nfsd reviewed separately, so
they can go into main first.
Although you are welcome to review all of D37519, I
doubt you'll want to look at the many:

variable-->VNET(variable)

changes, etc.

I just updated the diff in D37519 so that the arrays
and large structures in the vnet are malloc'd
(the vnet variable becomes just a pointer to them).
With that change, the size of the vnet'd variables is
about 300bytes and nfsd.ko will now load.

As such, my original reasons for having VNET_NFSD
are gone. However, bz@ indicates he would prefer
it to remain for a couple of reasons:

It isolates any possible security vulnerability introduced by the changes.
It makes it easy to find the changes.

However, he doesn't sound really "strongly for it", so
if you really think VNET_NFSD should go away now, I
suppose I can do that.
(It will be easy to get rid of later and I suppose we have
about 4 months to decide if VNET_NFSD should be in FreeBSD14.)

In D37741#858893, @rmacklem wrote:

I just updated the diff in D37519 so that the arrays
and large structures in the vnet are malloc'd
(the vnet variable becomes just a pointer to them).
With that change, the size of the vnet'd variables is
about 300bytes and nfsd.ko will now load.

As such, my original reasons for having VNET_NFSD
are gone. However, bz@ indicates he would prefer
it to remain for a couple of reasons:

It isolates any possible security vulnerability introduced by the changes.

It makes it easy to find the changes.

However, he doesn't sound really "strongly for it", so
if you really think VNET_NFSD should go away now, I
suppose I can do that.
(It will be easy to get rid of later and I suppose we have
about 4 months to decide if VNET_NFSD should be in FreeBSD14.)

I suspect that like VNET itself, VNET_NFSD will be useful enough that we'll want to enable it in GENERIC sooner or later. So, if VNET_NFSD does not carry significant additional overhead (300 bytes is peanuts compared to some of the other per-vnet structures), and there is some way to administratively disable nfs-in-vnet-jails by default (we can already do so on a per-jail basis it seems, maybe there should be a global sysctl to control it?), then my inclination would be to avoid a new config option.

sys/kern/vfs_mount.c
1364	It looks like any nmount(2) caller can pass an "export" option. Suppose we have a program running in a NFSD-enabled jail, and it tries to update an existing mount. For example, a jail might have a read-only nullfs mount passed into the jail (I think this is a fairly common setup). Can't the program abuse this new check to avoid calling vfs_suser()? If the program tries to update the read-only mount to a read-write mount, which ought to be prohibited, is anything stopping it aside from vfs_suser()?

This version of the patch prohibits "export" being used
with other options that modify mount semantics when
done in a vnet prison. I think this addresses the problem
markj@ noticed, where "export" could be used with other
mount options by root in the jail to change the semantics
of a mount done outside the jail.

I also got rid of VNET_NFSD. It can easily be put back in
if bz@ is strongly in favour of it.

rmacklem marked an inline comment as done.Jan 17 2023, 3:08 PM

rmacklem added inline comments.

sys/kern/vfs_mount.c
1364	Good catch! I had mistakenly thought that being "root" was sufficient, but being "root" inside the jail should not allow changing the options on a mount done outside the jail. I couldn't think of a really simple fix for this, so what I did was add code so that "export" cannot be used with other options that change the mount point's semantics, such as "ro". I made this change for nmount(2)/mount(2) being done from within a vnet jail with allow.nfsd only, so that any third party software that might combine "export" with other options will not break. I do not know if such software exists, but there was an alternative to mountd being used by some (called nfse, I think?).

kib added inline comments.Jan 17 2023, 11:28 PM

sys/kern/vfs_mount.c
1364	Perhaps a proper fix is to store the credentials of the user who did the export, and then checking if the current credentials are containing the export credentials WRT jail hierarchy.

Not sure how saving the credentials fixes this?

My understanding of Mark's comment was that this
change (which allows a file system mounted outside
of the jail, as in the jail's root fs, to be exported) could
also allow changing other mount options such as
"ro"->"rw" in the same nmount(2).

The file system is entirely within the jail's tree, since
prison_check_nfsd() confirms that the jail's root is a
file system mount point and the fspath argument to
nmount(2) can't go outside the jail, so I think the
nmount(2) can only export a file system that is within
the jail (the difference being that the jail's root fs must
be mounted before the jail starts, so it is not mounted
by the jail, which is the vfs_suser() check that fails without
this patch).

The patch now prevents any other mount option being
changed when the exports are done.

You check that the stored credentials are 'same or stronger' than the curthread' credentials. So if the mount was done by host root, for instance, then the jail root attempt to modify a mount option would fail.

Unfortunately that is exactly what we need to do in this case.

prison0's root mounts file system that is prisonN's root fs.
mountd running in prisonN exports file system mounted above.

When I first did this, I avoided the problem by:
A - Running mountd is prison0 and nfsd in prisonN.

This works and does not require any changes to nmount(2), but has two limitations that others did not like. 1 - NFSv4 only. 2 - Cannot export any file system mounted from within prisonN.

So I then switched to plan B:
B - Run mountd in prisonN along with nfsd.

This requires two changes to nmount(2). 1 - Don't call VFS_MOUNT() for the "export" case (or ignore the ENOENT reply you get if you do so, if you prefer. This does not seem to be an issue, since VFS_MOUNT() does nothing for the "export" case, at least for UFS and ZFS (the only ones I have checked). 2 - nmount(2) called within prisonN by prisonN's root must be able to update exports on the file systems visible within prisonN, even if the file system was mounted by prison0. (prisonN's root fs, for example)

It is this #2 which is problematic, but if only "exports" and no other
updates to the mount point is allowed, I think it is safe.

Now, I think something similar to your proposal might deal with
another outstanding issue:

If a sysadmin runs mountd in both prison0 and prisonN, they can "stomp" on the exports for the same file system and cause confusion. --> Saving a reference to the cred that did the exports would allow nmount(2) to "ignore" cases where the exports on the file system were done by a different prison (prison0 vs prisonN). --> I think this would avoid the "stomping" on each others exports and allow this to work. --> Having said this, I am not sure running mountd both in prison0 and prisonN should ever be done?

In D37741#865591, @rmacklem wrote:

Unfortunately that is exactly what we need to do in this case.

prison0's root mounts file system that is prisonN's root fs.

mountd running in prisonN exports file system mounted above.

When I first did this, I avoided the problem by:
A - Running mountd is prison0 and nfsd in prisonN.

This works and does not require any changes to nmount(2), but has two limitations that others did not like. 1 - NFSv4 only. 2 - Cannot export any file system mounted from within prisonN.

So I then switched to plan B:
B - Run mountd in prisonN along with nfsd.

This requires two changes to nmount(2). 1 - Don't call VFS_MOUNT() for the "export" case (or ignore the ENOENT reply you get if you do so, if you prefer. This does not seem to be an issue, since VFS_MOUNT() does nothing for the "export" case, at least for UFS and ZFS (the only ones I have checked). 2 - nmount(2) called within prisonN by prisonN's root must be able to update exports on the file systems visible within prisonN, even if the file system was mounted by prison0. (prisonN's root fs, for example)

It is this #2 which is problematic, but if only "exports" and no other
updates to the mount point is allowed, I think it is safe.

Now, I think something similar to your proposal might deal with
another outstanding issue:

If a sysadmin runs mountd in both prison0 and prisonN, they can "stomp" on the exports for the same file system and cause confusion. --> Saving a reference to the cred that did the exports would allow nmount(2) to "ignore" cases where the exports on the file system were done by a different prison (prison0 vs prisonN). --> I think this would avoid the "stomping" on each others exports and allow this to work. --> Having said this, I am not sure running mountd both in prison0 and prisonN should ever be done?

Hmm, I'm not sure why one wouldn't need to run mountd in both the jail host and in one or more jails? I would probably end up doing just that. If that's not possible I'd have to create a dummy jail, nullfs-mount my desired exports from the host into the jail, then run mountd/nfsd in that jail.

I think your solution of only allowing jailed programs to toggle MNT_EXPORTED is probably ok. Following your logic, I can't really see an alternative.

P.S. while reading vfs_export.c I discovered the "webnfs" mountd option. Does this get used by anyone to your knowledge?

The previous version of the patch missed a case
where mnt_flag could be updated by other flags
in the nmount(2) flag agument.

This version includes code that makes sure only
the MNT_EXPORTED flag gets set in mnt_flag
for the export in a jail case.

W.r.t. markj@'s questions...

Running mountd in multiple prisons is no problem, assuming the
prisons do not share the same file systems.

For the case of running mountd in both prison0 (the main system)
and in a prison can be a problem, because the mountd instances can
end up manipulating the exports on the same file system.

In practice, if the /etc/exports in the main system does not export any
file systems that are exported in prisons, it will normally work ok
so long as multiple mountd(8) instances do not update exports concurrently.
(When there is only one mountd, there is no problem.)
--> To fix this, mountd can use a syscall to indicate when it starts/ends

export updating and other mountd's will need to wait their turn.

For the case where misconfigured /etc/export files result in the same
file system being exported by multiple mountds, I think a variant of
Kostik's suggestion may at least detect/avoid updates by multiple
mountds, which avoids the confusion and conflicts can be logged
via syslogd.

If the exports on the file system includes an indication of which
prison set them (keeping a reference on the credentials used when doing the
export is probably the easiest), then credentials can be compared with the
credentials (the cr_prison part) doing an update and, if a different prison, the update would
fail. You could still have a misconfigured /etc/exports in the main system
with entries in it that will fail due to being exported within a jail, but at
least the conflict will be detected and mountd(8) would not break exports
set by another mountd(8) instance.

There might be an issue with holding a credential reference blocking a
jail from dying, but so long as mountd(8) was modiied so that it deletes
all exports it created when terminated (I can't recall if it does this now),
the the credentials should get dereferenced.

This is will definitely take a little while to come up with a patch, but I
think it can be done.

W.r.t. webnfs. I hope no one uses it and I haven't tested it in literally decades
but I cannot be sure no one uses it. (It was a special public file handle, which
allowed NFSv3 to avoid using the Mount RPC. Similar to what NFSv4 does,
except NFSv3 did not cross server file system mount points, so it applied to
a single file system on the server. The Sun thinking was this would allow a
public file system mountable across the internet, but I don't think many
ever used it.)

A little off topic, but I think I have figured out a
simple way to handle mountd running both in prison0
and other prisons.
--> define a mnt_flag MNT_EXJAIL.
--> set it when an export is done within a jail

check MNT_EXJAIL vfs_export() and reply an error when the caller is not in a prison and the flag is set or caller is in a prison and the flag is not set

This will at least avoid conflicting export updates and get
mountd to log the failures. With some man page updates,
I think this might be sufficient.

It doesn't handle the case where the same file system gets
exported from within multiple jails, but thi is a pretty big
configuration failure and will only result is NFS mount failures
for this case.

In D37741#865773, @rmacklem wrote:

A little off topic, but I think I have figured out a
simple way to handle mountd running both in prison0
and other prisons.
--> define a mnt_flag MNT_EXJAIL.
--> set it when an export is done within a jail

check MNT_EXJAIL vfs_export() and reply an error when the caller is not in a prison and the flag is set or caller is in a prison and the flag is not set

This will at least avoid conflicting export updates and get
mountd to log the failures. With some man page updates,
I think this might be sufficient.

It doesn't handle the case where the same file system gets
exported from within multiple jails, but thi is a pretty big
configuration failure and will only result is NFS mount failures
for this case.

This is essentially a subset of what I proposed. If the credentials are saved, you know exactly which jail mounted/exported the filesystem, instead of single bit which indicates that some jail did it.

Yes, it is a simple subset of what you suggested.
At this time, I do not think knowing exactly which
jail did the export is needed.

My concern w.r.t. the credentials is that these
referenced credentials will prevent a jail from
terminating due to a "jail -r". Right now mountd(8)
does not delete exports when it terminates. This
is harmless, because they get deleted when it is
restarted or when the file system is dismounted.
--> However, if there is a reference to a credential

for a prison, that prison will be stuck in dying
state (actually this currently happens for vnet
prisons and the problem is believed to be epair,
but presumably this will be fixes someday).

If/when identifying exactly which prison did the
exports is needed, I'd prefer assigning a unique
id to each prison that would still be a unique id
for that prison after it is terminated/removed.
(Something like a 64bit number that increments
for each noewly created jail.)
--> This could be stored in the exports without

needing a reference count on a credential.

As noted above, I think the single bit is sufficient
for now, but if you think I should do the 64bit version,
I can work on it.
--> This deals with mountd(8)/exports(5) configuration

issues and not markj@'s security vulnerability.

rmacklem mentioned this in D38144: Add MNT_EXJAIL to differentiate between exports in prisons vs prison0.Jan 24 2023, 11:05 PM

Re-coded the check for "export" being combined
with other options to be a long winded but simple
to read version. I do not think the semantics are
different than the previous version, but this one
is easy to read, since it just sets has_nonexport true
for any options not allowed to be used with "export"
in a prison.

This version does have slightly different semantics,
since it clears the MNT_EXPORTED flag in the syscall
flags argument for all prisons and not just ones enabled
for allow.nfsd.

rmacklem mentioned this in rGed03776ca7f4: nfsd: Enable the NFSD_VNET vnet front end macros.Feb 18 2023, 11:01 PM

The change looks ok to me, just a few small comments.

sys/kern/vfs_mount.c
865	I'd find the changes below a bit easier to read if they looked something like: has_nonexport = false; TAILQ_FOREACH_SAFE(...) { if (strcmp(opt->name, "update") != 0 && strcmp(opt->name, "fstype") != 0 && strcmp(opt->name, "fspath") != 0 && strcmp(opt->name, "from") != 0 && strcmp(opt->name, "errmsg") != 0) has_nonexport = true; ... } That way it's easier to verify at a glance the list of options that can be set together with "export".
1365	Can we assert this somehow?
1498	Just noting that this change and the one above belong to a different patch.

rmacklem marked an inline comment as done.Feb 21 2023, 1:17 AM

rmacklem added inline comments.

sys/kern/vfs_mount.c
865	Yep, that makes it more readable. I did add a "if (jailed() &&" at the front of it, so there wouldn't be the extra five strcmp() calls for each option for each non-jailed export. (There is a server out there that exports over 87,000 file systems.) I'll load an updated patch, since it no longer has all the following changes.
1365	Not easily. It would have to compare each field of the structure against the old one to see if there are other changes. A KASSERT() for the flags could be added fairly easily. I can do that if you think it is useful?
1498	Yep. I do have different versions of the files, but I do not keep the patches separate. I only try to separate them out for these reviews.

Put the checks for the options allowed with "export"
when within a jail in one spot, as suggested by markj2@.
I added a check for "jailed()" at the beginning of the "if"
to avoid the overhead of doing the strcmp()s for the non-jailed
case.

rmacklem mentioned this in D38743: nfsd: properly virtualize nfsstatsv1.Feb 23 2023, 2:12 PM

rmacklem mentioned this in D38808: Get rid of #ifdef VNET_NFSD.Feb 27 2023, 2:28 PM

rmacklem mentioned this in D38809: Fix rc scripts so that nfsd(8), nfsuserd(8), gssd(8) and rpc.tlsservd(8) can run in vnet prisons.Feb 27 2023, 2:43 PM

markj added inline comments.Feb 28 2023, 4:05 PM

sys/kern/vfs_mount.c.ident2
850 ↗	(On Diff #117710)	Presumably we can't clear MNT_EXPORTED unconditionally, due to backward compatibility constraints? It's less than ideal that the mount() and nmount() system calls behave slightly differently depending on whether they're being invoked in a jail. Actually, it's not entirely obvious to me why we don't want MNT_EXPORTED to be set this way. Also, can't the caller still specify other options using flags? That is, don't we want to filter out everything except MNT_UPDATE?

rmacklem marked an inline comment as done.Mar 1 2023, 1:48 AM

rmacklem added inline comments.

sys/kern/vfs_mount.c.ident2
850 ↗	(On Diff #117710)	Yes, another good catch. The next version will not use MNT_EXPORTED to indicate the specific case where mountd(8) should be allowed to do exports (and only exports) when running in a vnet prison, but the file system was mounted outside of the prison.

This version retains current mount behaviour,
except for the specific case of:

mount options are only those used by mountd for doing exports
in a prison that allows mountd(8) to run

When both of the above are true, a new argument
to vfs_domount() and vfs_domount_update() called
jail_export is true.

The vfs_suser() call is done, but if it returns an error
and "jail_export" is true, the error is cleared and another
new variable called vfs_suser_failed is set true.
This variable limits what flags can be updated. Ok, since
the syscall would have failed otherwise.
The call to VFS_MOUNT() is not done when "jail_exported"
is true, since it does nothing for the "export" case and will
fail.

I think this version retains all previous nmount(2) behaviour,
except for allowing this one specific additional case.

markj added inline comments.Mar 1 2023, 2:31 PM

sys/kern/vfs_mount.c.ident2
956 ↗	(On Diff #118110)	This looks a bit odd: we don't know whether we're in a jail at this point. Can't we assign `jail_export` below with: jail_export = (fsflags & MNT_EXPORTED) != 0 && !has_nonexport && jailed(td->td_ucred) && prison_check_nfsd(td->td_ucred); ? This would also catch the case where userspace specified MNT_EXPORTED using a flag rather than an option, though I'm not sure that anything does that.
1407 ↗	(On Diff #118110)	I suspect that this if-statement should be in the else block above.

rmacklem marked an inline comment as done.Mar 1 2023, 3:35 PM

rmacklem added inline comments.

sys/kern/vfs_mount.c.ident2
956 ↗	(On Diff #118110)	I don't think so. Since the code does not clear MNT_EXPORTED (which you were correct to point out it should not), it might be set in the fsflags argument. All this means is that the flag was set previously and this nmount(2) call may be to do something else, like change "ro" to "rw". My mistake in previous versions of the patch was to try and (mis)use MNT_EXPORTED to indicate that "export" has been set. You correctly pointed out that this would change behaviour for mount within a jail and that was not appropriate. This is why the code now uses "jail_export" to indicate the "special exception" of mountd(8) doing exports within a jail. The only case we want to make the "special exception" for is mountd(8) when it specified the "export" option. Part of the reason getting this correct has been tricky is that mountd(8) passes in the flags it gets from a statfs(2) call for the file system and MNT_EXPORTED might already be set. MNT_EXPORTED doesn't mean "do an export". It means "an export has been done".
1407 ↗	(On Diff #118110)	Yes, yet another good catch. I will change this.

rmacklem marked an inline comment as done.Mar 1 2023, 4:06 PM

rmacklem added inline comments.

sys/kern/vfs_mount.c.ident2
1407 ↗	(On Diff #118110)	I should note that, although I will move this code into the "else", it is only an optimization. Without doing so would not have broken anything, since the state of MNT_ASYNC has not changed and the worst case would be an extraneous mp->mnt_kern_flag &= ~MNTK_ASYNC when it is already cleared.
850 ↗	(On Diff #117710)	The case doing this would break is: 1 - file system mounted inside the jail 2 - file system exported by mountd(8) 3 - file system mount changed in some other way via "mount -u". The above would have cleared MNT_EXPORTED and "unexported the file system at #3. This is why only the case where the "export" option has been specified should be handled in a special way. Adding vfs_suser_failed was almost overkill, but it catches the unlikely case where a program other than mountd(8) does exports and other changes simultaneously. This should be allowed for the case where the file system was mounted inside the prison (and, therefore, vfs_suser() will return 0).

One more thing to note...
Although I think not clearing MNT_EXPORTED in
fsflags (as suggested by markj@), clearing it has
very little effect on semantics.

When set in fsflags, it enables
priv_check(PRIV_VFS_MOUNT_EXPORTED) and
that is the only thing that clearing it would disable.
(Prior to commit 195f1b1 that I did on Dec. 16, 2022
this priv_check() didn't get done when "export" was specified
and was only done on subsequent exports for the same file
system, plus anything else that happens to specify MNT_EXPORTED
in the nmount(2) flags argument.

This is because MNT_EXPORTED is not in MNT_UPDATEMASK.

rmacklem added inline comments.Mar 2 2023, 2:52 PM

sys/kern/vfs_mount.c.ident2
850 ↗	(On Diff #117710)	I now believe the above comment is not accurate, since MNT_EXPORTED is not in MNT_UPDATEMASK and would not be set in the nmount(2) flags argument for #3. I still think the current patch is ok and that not clearing it is fine.