The -J flag recently added to sysctl(8) helps to filter listing down to the variables which are ment to be per jail. The filtering is based on CTLFLAG_PRISON flag presence, which is mostly used along with CTLFLAG_RW. Hence, read-only (CTLFLAG_RD) per-jail variables are not listed. As long as CTLFLAG_RD | CTLFLAG_PRISON combination is not meant to change the behavior, we could use CTLFLAG_PRISON flag in read-only cases as well to indicate that a variable is designed to be per jail. The CTLFLAG_PRISON flag is added to the following read-only sysctlS: - kern.osrelease - kern.osreldate - security.jail.list - security.jail.children.max - security.jail.children.cur - security.jail.jailed - security.jail.vnet No functional change intended for these sysctlS. MFC after: 3 weeks
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 60733 Build 57617: arc lint + arc unit
Event Timeline
This is the followup of https://reviews.freebsd.org/D47107.
I intentionally omitted the CTLFLAG_PRISON for these two sysctl during their introduction due to the docs say that such sysctl can be written to by processes in jail(2). As long as this couple is read-only I did not add the flag.
For now, after a quick code overview, it looks that this flag is not about writable, it's about "value per jail". I think that we could tune the docs and actually add this flag for a bunch of existing sysctl. A very quick counting on my side yields 18 of them like .jailed, .vnet, etc.
This very short patch is a discussion starter, I can extend it to cover more sysctl and the docs as well. What do you think?
Another point is how to deal with sysctl which are "per jail" and vnet-related as well. I guess, they could end up with both flags?
sysctl: Add missing CTLFLAG_PRISON to security.jail.children.*
@igoro The subsystem should be jail, so jail: Add missing CTLFLAG_PRISON to security.jail.children.* . Also it is important to explain why.
Also it is better to add commit meta message
Fixes: ab0841bdbe84 jail: expose children.max and children.cur via sysctl
I guess no. Actually a VNET instance can be shared between multiple jail prisons, e.g., legacy jails those share vnet0.
I like the idea of flagging sysctls that hold values that can be read per jail. If jail(2) says otherwise, we can change that to match.
I agree with your code review.
I like the idea. The only inconvenience I see is that this flagging has to be done by hand, and there is no real automatic way to prevent wrong flagging with CTLFLAG_PRISON (e.g., conditional on the use of curthread->td_cred->cr_prison), but these are very minor, and certainly not worse than the current situation.
This hasn't been done before simply because the CTLFLAG_PRISON was never actually checked if not trying to write to the corresponding sysctl. Now that D47107 has been committed, this change would make a useful difference IMO.
I had to double check the things. I've delved into the details and discovered that the existing code does not allow to use CTLFLAG_PRISON as a flag which means that a variable varies per jail. I've stumbled upon the *allow* variables like these ones:
SYSCTL_PROC(_security_jail, OID_AUTO, set_hostname_allowed, CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, NULL, PR_ALLOW_SET_HOSTNAME, sysctl_jail_default_allow, "I", "Processes in jail can set their hostnames (deprecated)"); SYSCTL_PROC(_security_jail, OID_AUTO, mlock_allowed, CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, NULL, PR_ALLOW_MLOCK, sysctl_jail_default_allow, "I", "Processes in jail can lock/unlock physical pages in memory");
If we add CTLFLAG_PRISON to such sysctlS it will add some mess to the existing interface and behavior, and can add headache for future code. Having such flag added to these variables makes a feeling they can be changed but instead of a correct message that "Operation not permitted" we get something like this:
root@somejail:/ # sysctl security.jail.set_hostname_allowed=1 security.jail.set_hostname_allowed: 0 -> 0
Currently, I see two branches here:
Plan A. If sysctl -J flag was meant to be used to list only variables which can be fiddled by prisoned roots then we keep it as is and probably re-phrase the sysctl man page a bit to make it clear.
Plan B. If the idea of sysctl -J flag is more about listing all sysctl variables which are allocated/meant per jail and may vary per jail, whether they are read-only or writable, then it seems to be better to add a brand new flag for this work. There is a strong feeling that we must keep the existing one named as is due to a lot of mentioning around in the code, docs, articles, etc. And the new one could be something like CTLFLAG_PERJAIL. And, probably, that's a good idea to keep this possibility to have a sysctl variable which is allocated per jail but is not allowed to be changed by prisoned roots.
What do you think?
P.S. Sorry, I've missed the original discussion of the -J sysctl flag, it seems this discussion should have happened there.
So, the problem you have with these sysctl knobs is that they are supposed to be writable from the host but not from jails, while their value is per jail. That comes from the fact that they (unfortunately, IMHO) in fact serve two different purposes:
- They allow to set the default for their corresponding allow bit when creating top-level jails.
- They allow to retrieve the current setting for any jail.
In passing, these knobs are deprecated (in favor of jail parameters for 2; for 1 this is unclear to me), so I would not try too hard to cater to them (e.g., they could be omitted, if need be).
Do you know if there are other knobs with this exact situation?
As food for thought, in order to be more flexible in the future and easier on coders, I'd consider introducing new flags such as CTLFLAG_PRISON_RD and CTLFLAG_PRISON_RW, following the precedents of CTLFLAG_RD and CTLFLAG_RW, and CTLFLAG_CAPRD and CTLFLAG_CAPRW (except for the last underscore), with obvious meanings, and in addition reserve CTLFLAG_PRISON to only indicate knobs having per-jail value. I'm not sure if there will be actual cases where CTLFLAG_PRISON_RW without CTLFLAG_PRISON would make sense, but even if none materializes, I find it easier to reason about these clearly delineated flags than with conflated ones. Even a combination like CTLFLAG_PRISON alone, without neither CTLFLAG_PRISONRD nor CTLFLAG_PRISONRW, which doesn't make sense today, might be leveraged in the future.
Finally, whatever the outcome, and though I read you on compatibility, I'd still consider renaming CTLFLAG_PRISON to something more meaningful. I only see 18 occurences of it in the current tree, so its usage is low enough that changing it shouldn't be that much of a burden and only on a small amount of people. And I think it is especially desirable that we are not backwards compatible if the meaning of CTLFLAG_PRISON is to be changed (as in the above proposal).
It's worse than that - it actually changes prison0's view of that bit.
These particular sysctls are an ugly back-compat hack. In the old world, jail permissions were global; the host could set them and jails could read them. Now such permissions are per-jail, but for backward compatibility they do two different things: For prison0, they read or write the default value of the corresponding permission bit, but only for jails that are created via the old jail(2) interface. Inside a prison, they read that prison's corresponding permission bit. They really ought to be deprecated, along with jail(2) itself.
Aside from those, almost all sysctls that refer to the current prison are either CTLFLAG_RD or CTLFLAG_PRISON. A quick search of req->td->td_ucred->cr_prison yields only security.mac.do.rules. There may be others, since perhaps not all sysctl functions get cr_prison in a single statements like that. But those (and the back-compat ones) can be fixed to explicitly check for the prison in the write case, instead of implicitly as the comment in sys_jail_default_allow notes.
With such a fix, CTLFLAG_PRISON can be added to them (it can already be added to the CTLFLAG_RD ones) without needing a second or third CTLFLAG_PRISON_XXX flag.
I intend to have security.mac.do.rules as writable from both the host and jails, so this one will be no exception (CTLFLAG_RW|CTLFLAG_PRISON). In a series of not yet reviewed changes, I added CTLFLAG_PRISON to this knob. However, I ask that you don't do that now, as the current sysctl implementation is wrong for writes from jails (also fixed in the series to be published for review).
With such a fix, CTLFLAG_PRISON can be added to them (it can already be added to the CTLFLAG_RD ones) without needing a second or third CTLFLAG_PRISON_XXX flag.
The most common cases are allowing the same accesses from jails as from the host, or allowing just read access from jails and read-write from the host and they're indeed covered by the existing flags. However, using CTLFLAG_PRISON for read-only per-jail knobs changes that situation. I'd prefer to have separate flags as evoked above, but if the only offending sysctls are those we considered, I'm OK to defer that and have an explicit check in their implementation.
Thank you all for the discussion. I've refreshed the context for myself and I would like to share it:
- We have got a new -J flag for sysctl(8). It filters the variables by CTLFLAG_PRISON.
- It reminded me of a couple of existing sysctlS which are not listed with the -J flag.
- That's because they are read-only (CTLFLAG_RD) without CTLFLAG_PRISON flag set. The reasoning follows.
- The CTLFLAG_PRISON has been used as a way to allow jailed roots modify a sysctl. It works very simple way. If a sysctl sys call request means modification, i.e. a new buffer (the req->newptr) is provided, then the sys call will check if the user has PRIV_SYSCTL_WRITE privilege. It does not work for a jailed root, that's why the sys call will check for another privilege (PRIV_SYSCTL_WRITEJAIL) if the sysctl in question has CTLFLAG_PRISON flag set. That's the only use case for this flag for now.
- And it's fine that read-only per-jail variables do not have this flag, it's not needed for them.
- As long as I found this little inconsistency I proposed to update the meaning of this flag (sysctl.9) and add it to some read-only variables which are, kind of, expected to be found in the sysctl -aJ output.
I've revised the existing variables with CTLFLAG_PRISON flag set, most of them are writable as expected. Just for the reference:
compat.linux.osname CTLFLAG_RW compat.linux.osrelease CTLFLAG_RW compat.linux.oss_version CTLFLAG_RW hw.vmm.destroy CTLFLAG_RW hw.vmm.create CTLFLAG_RW kern.hostname CTLFLAG_RW kern.domainname CTLFLAG_RW kern.hostuuid CTLFLAG_RW kern.securelevel CTLFLAG_RW kern.hostid CTLFLAG_RW kern.ipc.posix_shm_list CTLFLAG_RD net.inet.tcp.getcred CTLFLAG_RW net.inet6.tcp6.getcred CTLFLAG_RW net.inet.udp.getcred CTLFLAG_RW security.bsd.suser_enabled CTLFLAG_RWTUN security.bsd.unprivileged_proc_debug CTLFLAG_RW
It seems we agreed not to touch security.jail.*allowed* and similar deprecated variables which are based on sysctl_jail_default_allow() and sysctl_jail_default_level() sysctl handlers:
security.jail.set_hostname_allowed security.jail.socket_unixiproute_only security.jail.sysvipc_allowed security.jail.allow_raw_sockets security.jail.chflags_allowed security.jail.mount_allowed security.jail.mlock_allowed security.jail.enforce_statfs security.jail.devfs_ruleset
Also, @olce asked to leave the following one as is due to there is a separate work going on around that:
security.mac.do.rules
The following candidates are left for updating:
kern.osrelease kern.osreldate security.jail.list (CTLTYPE_STRUCT) security.jail.children.max security.jail.children.cur security.jail.jailed security.jail.vnet
I've reworked this patch to align with the above. As long as this change is meant to be a follow-up to the addition of -J flag to sysctl(8) (https://cgit.freebsd.org/src/commit/?id=5ec83c660acaf30c1d6b9417dbd8c80dfa9d56ac) it's proposed to be MFC'ed too.
The demo of the results is as follows:
> # BEFORE: > sysctl -aJ kern.securelevel: -1 kern.hostname: test kern.hostid: 2943262026 kern.domainname: kern.hostuuid: 0ac945e4-230b-4396-af07-290441a6e1f9 security.bsd.unprivileged_proc_debug: 1 security.bsd.suser_enabled: 1 > # AFTER: > sysctl -aJ kern.osrelease: 15.0-CURRENT kern.securelevel: -1 kern.hostname: test kern.hostid: 532967369 kern.domainname: kern.osreldate: 1500027 kern.hostuuid: 317a6b09-7292-45d6-bac1-66ed0694d7f8 security.bsd.unprivileged_proc_debug: 1 security.bsd.suser_enabled: 1 security.jail.children.cur: 0 security.jail.children.max: 999999 security.jail.vnet: 0 security.jail.jailed: 0
Perhaps we could leave the more serious refactoring for later. What do you think?