Paths

Table of Contentst

sysctl: Revise meaning of CTLFLAG_PRISON
Needs ReviewPublic
Actions

Authored by igoro on Oct 29 2024, 5:43 PM.

Details

Reviewers

markj
kp
zlei
jamie
olce

Summary

The -J flag recently added to sysctl(8) helps to filter listing down to
the variables which are ment to be per jail. The filtering is based on
CTLFLAG_PRISON flag presence, which is mostly used along with
CTLFLAG_RW. Hence, read-only (CTLFLAG_RD) per-jail variables are not
listed.

As long as CTLFLAG_RD | CTLFLAG_PRISON combination is not meant to
change the behavior, we could use CTLFLAG_PRISON flag in read-only cases
as well to indicate that a variable is designed to be per jail.

The CTLFLAG_PRISON flag is added to the following read-only sysctlS:
- kern.osrelease
- kern.osreldate
- security.jail.list
- security.jail.children.max
- security.jail.children.cur
- security.jail.jailed
- security.jail.vnet

No functional change intended for these sysctlS.

MFC after:      3 weeks

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 60733
Build 57617: arc lint + arc unit

Event Timeline

igoro created this revision.Oct 29 2024, 5:43 PM

Herald added subscribers: olce, imp. · View Herald TranscriptOct 29 2024, 5:43 PM

igoro requested review of this revision.Oct 29 2024, 5:43 PM

Harbormaster completed remote builds in B60251: Diff 145612.Oct 29 2024, 5:44 PM

This is the followup of https://reviews.freebsd.org/D47107.

I intentionally omitted the CTLFLAG_PRISON for these two sysctl during their introduction due to the docs say that such sysctl can be written to by processes in jail(2). As long as this couple is read-only I did not add the flag.

For now, after a quick code overview, it looks that this flag is not about writable, it's about "value per jail". I think that we could tune the docs and actually add this flag for a bunch of existing sysctl. A very quick counting on my side yields 18 of them like .jailed, .vnet, etc.

This very short patch is a discussion starter, I can extend it to cover more sysctl and the docs as well. What do you think?

Another point is how to deal with sysctl which are "per jail" and vnet-related as well. I guess, they could end up with both flags?

This looks good to me.

This revision is now accepted and ready to land.Oct 30 2024, 2:22 AM

sysctl: Add missing CTLFLAG_PRISON to security.jail.children.*

@igoro The subsystem should be jail, so jail: Add missing CTLFLAG_PRISON to security.jail.children.* . Also it is important to explain why.

Also it is better to add commit meta message

Fixes: ab0841bdbe84 jail: expose children.max and children.cur via sysctl

In D47332#1079398, @igoro wrote:

Another point is how to deal with sysctl which are "per jail" and vnet-related as well. I guess, they could end up with both flags?

I guess no. Actually a VNET instance can be shared between multiple jail prisons, e.g., legacy jails those share vnet0.

I like the idea of flagging sysctls that hold values that can be read per jail. If jail(2) says otherwise, we can change that to match.

In D47332#1079396, @igoro wrote:

For now, after a quick code overview, it looks that this flag is not about writable, it's about "value per jail". I think that we could tune the docs and actually add this flag for a bunch of existing sysctl. A very quick counting on my side yields 18 of them like .jailed, .vnet, etc.

I agree with your code review.

I like the idea. The only inconvenience I see is that this flagging has to be done by hand, and there is no real automatic way to prevent wrong flagging with CTLFLAG_PRISON (e.g., conditional on the use of curthread->td_cred->cr_prison), but these are very minor, and certainly not worse than the current situation.

This hasn't been done before simply because the CTLFLAG_PRISON was never actually checked if not trying to write to the corresponding sysctl. Now that D47107 has been committed, this change would make a useful difference IMO.

In D47332#1079396, @igoro wrote:

This is the followup of https://reviews.freebsd.org/D47107.

I intentionally omitted the CTLFLAG_PRISON for these two sysctl during their introduction due to the docs say that such sysctl can be written to by processes in jail(2). As long as this couple is read-only I did not add the flag.

For now, after a quick code overview, it looks that this flag is not about writable, it's about "value per jail". I think that we could tune the docs and actually add this flag for a bunch of existing sysctl. A very quick counting on my side yields 18 of them like .jailed, .vnet, etc.

This very short patch is a discussion starter, I can extend it to cover more sysctl and the docs as well. What do you think?

Approved, but I'd like a version of the above in the commit message too.

I had to double check the things. I've delved into the details and discovered that the existing code does not allow to use CTLFLAG_PRISON as a flag which means that a variable varies per jail. I've stumbled upon the *allow* variables like these ones:

SYSCTL_PROC(_security_jail, OID_AUTO, set_hostname_allowed,
     CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE,
     NULL, PR_ALLOW_SET_HOSTNAME, sysctl_jail_default_allow, "I",
     "Processes in jail can set their hostnames (deprecated)");

SYSCTL_PROC(_security_jail, OID_AUTO, mlock_allowed,
     CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE,
     NULL, PR_ALLOW_MLOCK, sysctl_jail_default_allow, "I",
     "Processes in jail can lock/unlock physical pages in memory");

If we add CTLFLAG_PRISON to such sysctlS it will add some mess to the existing interface and behavior, and can add headache for future code. Having such flag added to these variables makes a feeling they can be changed but instead of a correct message that "Operation not permitted" we get something like this:

root@somejail:/ # sysctl security.jail.set_hostname_allowed=1
security.jail.set_hostname_allowed: 0 -> 0

Currently, I see two branches here:

Plan A. If sysctl -J flag was meant to be used to list only variables which can be fiddled by prisoned roots then we keep it as is and probably re-phrase the sysctl man page a bit to make it clear.

Plan B. If the idea of sysctl -J flag is more about listing all sysctl variables which are allocated/meant per jail and may vary per jail, whether they are read-only or writable, then it seems to be better to add a brand new flag for this work. There is a strong feeling that we must keep the existing one named as is due to a lot of mentioning around in the code, docs, articles, etc. And the new one could be something like CTLFLAG_PERJAIL. And, probably, that's a good idea to keep this possibility to have a sysctl variable which is allocated per jail but is not allowed to be changed by prisoned roots.

What do you think?

P.S. Sorry, I've missed the original discussion of the -J sysctl flag, it seems this discussion should have happened there.

So, the problem you have with these sysctl knobs is that they are supposed to be writable from the host but not from jails, while their value is per jail. That comes from the fact that they (unfortunately, IMHO) in fact serve two different purposes:

They allow to set the default for their corresponding allow bit when creating top-level jails.
They allow to retrieve the current setting for any jail.

In passing, these knobs are deprecated (in favor of jail parameters for 2; for 1 this is unclear to me), so I would not try too hard to cater to them (e.g., they could be omitted, if need be).
Do you know if there are other knobs with this exact situation?

As food for thought, in order to be more flexible in the future and easier on coders, I'd consider introducing new flags such as CTLFLAG_PRISON_RD and CTLFLAG_PRISON_RW, following the precedents of CTLFLAG_RD and CTLFLAG_RW, and CTLFLAG_CAPRD and CTLFLAG_CAPRW (except for the last underscore), with obvious meanings, and in addition reserve CTLFLAG_PRISON to only indicate knobs having per-jail value. I'm not sure if there will be actual cases where CTLFLAG_PRISON_RW without CTLFLAG_PRISON would make sense, but even if none materializes, I find it easier to reason about these clearly delineated flags than with conflated ones. Even a combination like CTLFLAG_PRISON alone, without neither CTLFLAG_PRISONRD nor CTLFLAG_PRISONRW, which doesn't make sense today, might be leveraged in the future.

Finally, whatever the outcome, and though I read you on compatibility, I'd still consider renaming CTLFLAG_PRISON to something more meaningful. I only see 18 occurences of it in the current tree, so its usage is low enough that changing it shouldn't be that much of a burden and only on a small amount of people. And I think it is especially desirable that we are not backwards compatible if the meaning of CTLFLAG_PRISON is to be changed (as in the above proposal).

In D47332#1082956, @igoro wrote:
If we add CTLFLAG_PRISON to such sysctlS it will add some mess to the existing interface and behavior, and can add headache for future code. Having such flag added to these variables makes a feeling they can be changed but instead of a correct message that "Operation not permitted" we get something like this:
root@somejail:/ # sysctl security.jail.set_hostname_allowed=1
security.jail.set_hostname_allowed: 0 -> 0

It's worse than that - it actually changes prison0's view of that bit.

These particular sysctls are an ugly back-compat hack. In the old world, jail permissions were global; the host could set them and jails could read them. Now such permissions are per-jail, but for backward compatibility they do two different things: For prison0, they read or write the default value of the corresponding permission bit, but only for jails that are created via the old jail(2) interface. Inside a prison, they read that prison's corresponding permission bit. They really ought to be deprecated, along with jail(2) itself.

Aside from those, almost all sysctls that refer to the current prison are either CTLFLAG_RD or CTLFLAG_PRISON. A quick search of req->td->td_ucred->cr_prison yields only security.mac.do.rules. There may be others, since perhaps not all sysctl functions get cr_prison in a single statements like that. But those (and the back-compat ones) can be fixed to explicitly check for the prison in the write case, instead of implicitly as the comment in sys_jail_default_allow notes.

With such a fix, CTLFLAG_PRISON can be added to them (it can already be added to the CTLFLAG_RD ones) without needing a second or third CTLFLAG_PRISON_XXX flag.

In D47332#1083086, @jamie wrote:

Aside from those, almost all sysctls that refer to the current prison are either CTLFLAG_RD or CTLFLAG_PRISON. A quick search of req->td->td_ucred->cr_prison yields only security.mac.do.rules. There may be others, since perhaps not all sysctl functions get cr_prison in a single statements like that. But those (and the back-compat ones) can be fixed to explicitly check for the prison in the write case, instead of implicitly as the comment in sys_jail_default_allow notes.

I intend to have security.mac.do.rules as writable from both the host and jails, so this one will be no exception (CTLFLAG_RW|CTLFLAG_PRISON). In a series of not yet reviewed changes, I added CTLFLAG_PRISON to this knob. However, I ask that you don't do that now, as the current sysctl implementation is wrong for writes from jails (also fixed in the series to be published for review).

With such a fix, CTLFLAG_PRISON can be added to them (it can already be added to the CTLFLAG_RD ones) without needing a second or third CTLFLAG_PRISON_XXX flag.

The most common cases are allowing the same accesses from jails as from the host, or allowing just read access from jails and read-write from the host and they're indeed covered by the existing flags. However, using CTLFLAG_PRISON for read-only per-jail knobs changes that situation. I'd prefer to have separate flags as evoked above, but if the only offending sysctls are those we considered, I'm OK to defer that and have an explicit check in their implementation.

igoro retitled this revision from sysctl: Add missing CTLFLAG_PRISON to security.jail.children.* to sysctl: Revise meaning of CTLFLAG_PRISON.Nov 22 2024, 7:39 PM

Pivot the patch.

This revision now requires review to proceed.Nov 22 2024, 7:41 PM

Harbormaster completed remote builds in B60733: Diff 146847.Nov 22 2024, 7:41 PM

Thank you all for the discussion. I've refreshed the context for myself and I would like to share it:

We have got a new -J flag for sysctl(8). It filters the variables by CTLFLAG_PRISON.
It reminded me of a couple of existing sysctlS which are not listed with the -J flag.
That's because they are read-only (CTLFLAG_RD) without CTLFLAG_PRISON flag set. The reasoning follows.
The CTLFLAG_PRISON has been used as a way to allow jailed roots modify a sysctl. It works very simple way. If a sysctl sys call request means modification, i.e. a new buffer (the req->newptr) is provided, then the sys call will check if the user has PRIV_SYSCTL_WRITE privilege. It does not work for a jailed root, that's why the sys call will check for another privilege (PRIV_SYSCTL_WRITEJAIL) if the sysctl in question has CTLFLAG_PRISON flag set. That's the only use case for this flag for now.
And it's fine that read-only per-jail variables do not have this flag, it's not needed for them.
As long as I found this little inconsistency I proposed to update the meaning of this flag (sysctl.9) and add it to some read-only variables which are, kind of, expected to be found in the sysctl -aJ output.

I've revised the existing variables with CTLFLAG_PRISON flag set, most of them are writable as expected. Just for the reference:

compat.linux.osname                    CTLFLAG_RW
compat.linux.osrelease                 CTLFLAG_RW
compat.linux.oss_version               CTLFLAG_RW
hw.vmm.destroy                         CTLFLAG_RW
hw.vmm.create                          CTLFLAG_RW
kern.hostname                          CTLFLAG_RW
kern.domainname                        CTLFLAG_RW
kern.hostuuid                          CTLFLAG_RW
kern.securelevel                       CTLFLAG_RW
kern.hostid                            CTLFLAG_RW
kern.ipc.posix_shm_list                CTLFLAG_RD
net.inet.tcp.getcred                   CTLFLAG_RW
net.inet6.tcp6.getcred                 CTLFLAG_RW
net.inet.udp.getcred                   CTLFLAG_RW
security.bsd.suser_enabled             CTLFLAG_RWTUN
security.bsd.unprivileged_proc_debug   CTLFLAG_RW

It seems we agreed not to touch security.jail.*allowed* and similar deprecated variables which are based on sysctl_jail_default_allow() and sysctl_jail_default_level() sysctl handlers:

security.jail.set_hostname_allowed
security.jail.socket_unixiproute_only
security.jail.sysvipc_allowed
security.jail.allow_raw_sockets
security.jail.chflags_allowed
security.jail.mount_allowed
security.jail.mlock_allowed
security.jail.enforce_statfs
security.jail.devfs_ruleset

Also, @olce asked to leave the following one as is due to there is a separate work going on around that:

security.mac.do.rules

The following candidates are left for updating:

kern.osrelease
kern.osreldate
security.jail.list (CTLTYPE_STRUCT)
security.jail.children.max
security.jail.children.cur
security.jail.jailed
security.jail.vnet

I've reworked this patch to align with the above. As long as this change is meant to be a follow-up to the addition of -J flag to sysctl(8) (https://cgit.freebsd.org/src/commit/?id=5ec83c660acaf30c1d6b9417dbd8c80dfa9d56ac) it's proposed to be MFC'ed too.

The demo of the results is as follows:

> # BEFORE:
> sysctl -aJ
kern.securelevel: -1
kern.hostname: test
kern.hostid: 2943262026
kern.domainname:
kern.hostuuid: 0ac945e4-230b-4396-af07-290441a6e1f9
security.bsd.unprivileged_proc_debug: 1
security.bsd.suser_enabled: 1

> # AFTER:
> sysctl -aJ
kern.osrelease: 15.0-CURRENT
kern.securelevel: -1
kern.hostname: test
kern.hostid: 532967369
kern.domainname:
kern.osreldate: 1500027
kern.hostuuid: 317a6b09-7292-45d6-bac1-66ed0694d7f8
security.bsd.unprivileged_proc_debug: 1
security.bsd.suser_enabled: 1
security.jail.children.cur: 0
security.jail.children.max: 999999
security.jail.vnet: 0
security.jail.jailed: 0

Perhaps we could leave the more serious refactoring for later. What do you think?

igoro edited the summary of this revision. (Show Details)Nov 22 2024, 7:53 PM