Each one is an arbitrary string associated with a jail. It can be set upon jail creation or added/modified later: > jail -cm ... meta="tag1=value1 tag2=value2" env="configuration" The values are not inherited from the parent jail. A parent jail can read both metadata parameters, while a child jail can read only env via security.jail.env sysctl. The maximum size of meta or env per jail is controlled by the global security.jail.meta_maxbufsize sysctl. Decreasing it does not alter the existing meta information. The set of allowed chars is controlled through the global security.jail.meta_allowedchars sysctl. Each metadata buffer can be handled as a set of key=value\n strings: > jail -cm ... meta="$(echo t1=v1; echo t2=v2)" env.1=one > jls meta.t2 env.1 meta.t1 Sponsored by: SkunkWerks GmbH
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 61290 Build 58174: arc lint + arc unit
Event Timeline
Currently, there is no accessible way to attach metadata to jails. This is commonly used elsewhere, for example in Kubernetes, to allow non-unique properties to enrich load balancers, schedulers, and volume provisioners to make more informed decisions.
The initial motivation was to be able to have an external load balancer to update its pool of endpoints based on jail metadata - for example, as a web application is updated from v1 to v2, the load balancer would query the “www” tags, and be able to grow its pool until there are 3 new v2 jails, and then remove the older v1 versions.
While there are already other mechanisms to make metadata available inside the jail, these rely on shared memory, or files, and are not directly tied to the jail’s lifecycle.
The Object Specific Data approach is already used with jails, to provide the zfs-specific dataset names, and jailed on/off properties. This implementation follows the same approach allowing generic metadata, accessible from outside, inside, and via the existing jail and flua tools.
From the technical perspective, there are the following open topics:
- It seems that _security_jail sysctl node ref exposure could be extracted as a separate patch if confirmed.
- There is a question of whether the kernel should limit the set of allowed characters within the meta buffer, or if it's better to keep such policies outside of the kernel business.
- The current state of the patch allows reading meta by any user within a jail. Do we want to disable it by default and add something like allow.read_meta or allow.metadata parameter to control it per jail?
I can see the appeal of this to jail managers, but having just a single meta sysctl per jail implies that it has to be "owned" by a single writer. This requires everything wanting to make use of this new feature to be tightly coupled to the writer. A flat list of key=value pairs (instead of a single value) would allow multiple users of this feature per jail e.g. multiple helper commands preparing just one aspect of a jail e.g. dynamic devfs ruleset loading, network setup/teardown, storage provisioning.
If I understand the code correctly security.jail.meta_maxbufsize is the upper limit of the amount of additional kernel memory the jail can tie down with this feature. If a jail is allowed to create sub-jails should each of them be able to allocate metadata up to the global limit or should each of them get its own limit and the allocation be counted (recursively) against the parent limits?
Just tossing in a general +1, I haven't taken any time to review this at the moment; I've thought about a similar notion in the past (jail tags) to be able to do some role-based querying of jails.
Another +1.
Testing here using podman to inject metadata into containers, something similar to the original motivation.
Regarding the restrictions for reading the sysctl from inside the jail, I have no strong opinions, and fully trust your judgement (based on current names, most likely allow.read_meta, as I would interpret allow.metadata as being able to set metadata from inside the jail).
Right now I use the per-jail host.hostuuid to track jail identities and keep the associated state somewhere under /var/db/jails/${host_hostuuid}/ which is ugly.
This is the way to go IMHO. It would take more work to support it, but a lot less to use it.
If I understand the code correctly security.jail.meta_maxbufsize is the upper limit of the amount of additional kernel memory the jail can tie down with this feature. If a jail is allowed to create sub-jails should each of them be able to allocate metadata up to the global limit or should each of them get its own limit and the allocation be counted (recursively) against the parent limits?
I like the hierarchical limits.
I see a use case for both. Per-jail metadata just for the jail manager, and per-jail environment that the jail itself can see. If both exist, there's no need to set permissions.
There is a third usecase: Reporting *changes* to metadata from inside jail to the host (e.g. service health to a jail manager that ties into a load balancer). So there are (at least) three useful access permissions (host only, host write+jail read, host+jail write), but at some point it probably makes more sense to create a small tmpfs and nullfs mount it into the jail with the desired permissions.
Is this jail metadata visible inside the jail? For podman, read-only host-visible metadata can be added as annotations on the container but these are not visible inside the container's jail.
Yes, it is!
# podman run -it --rm ghcr.io/jlduran/freebsd:15.0 sh # sysctl security.jail.meta security.jail.meta: <do not exit>
Then inject the metadata:
jail -m jid=<jail-id> meta="www=v2"
Go back to the container, and check:
# sysctl security.jail.meta security.jail.meta: www=v2
There seems to be strong support for separating tags. I am
indifferent to either approach (I can do what I need either way)
but there is a strong tradeoff:
The current approach is simple but clean - (one writer updates all
tags in a single "transaction"), and can be made compare-and-swap if
needed with minimal effort in future. Reading all tags again is
idempotent. This is sufficient for slow-moving / changing targets,
when most tags are applied at jail creation.
Using multiple tags mean its not possible to update all tags in a
single transaction, but it can be operated on more cleanly in a sysctl
style view. I don't know if this approach works well for string vs
integer data - are sysctls typed at all?
security.jail.meta.private.foo=bar
security.jail.meta.private.bar=baz
security.jail.meta.private.baz=1
Reading through https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
app.kubernetes.io/name
- The name of the application mysql string
app.kubernetes.io/instance
- A unique name identifying the instance of an application mysql-abcxyz string
How could we support this arrangement? While good OCI alignment was
not part of the brief, it would be desirable to support it.
How would the multi-tag view work if exposed via sysctl?
How do we incorporate "app.kubernetes.io/instance" as a single key, in a sysctl
style view?
What would happen when security.jail.meta.private.baz=1 changes type, from 1 to "bim" ?
What is the state of data if you read security.jail.meta.io.kubernetes.app,
returning 2 keys and values, if its not a single read transaction?
security.jail.meta.io.kubernetes.app.name
security.jail.meta.io.kubernetes.app.instance
They could have inconsistent values.
I'm not opposed to either implementation, or another one, I just think that
we introduce a lot of hidden complexity *for the kernel* rather than keep
internal simple and leave the mess to userland tools.
I would love to see future OCI/kubernetes needs accommodated. Comments welcomed
on how we could support this!
It seems there is an agreement to split it onto two buffers per jail: both are readable by parent jail, while only one is readable by a jail itself. The updated patch reflects this concept. For now metaext and metaint naming is used as external/internal concept. The naming is open for discussion.
Current state of the patch still does not insist on the structure of data in a buffer, it's up to users. As a demonstration of currently meant UI/UX, which obviously is not very straightforward, users could leverage the usual Unix-way tools for "tagging":
> jail -c name=tst persist metaext="$(echo role=www; echo version=20240824)" > jls -jtst metaext role=www version=20240824 > jls -jtst metaext | grep ^role= | cut -d= -f2 www > jls -jtst metaext | grep ^version= | cut -d= -f2 20240824 > jail -m name=tst metaext="$(jls -jtst metaext | sed 's/^version=.*$/version=20241120/')" > jls -jtst metaext role=www version=20241120
This is where we have the open space for the next ideas which would ease "tagging": sysctl-based, a separate program like jtags, and so on.
There is a third useful combination: (privileged) access host access only e.g. as a place to store API token or similar things without ever writing them to a file.
The problem having a fixed set of sysctls with with their designated access permissions is that all accesses have to go through a single writer and use the same data format because a single sysctl is just one string. This means anyone else is required closely coupled to that writer unless there is some standardised patch and query interface (e.g. JSON patch and JSON query) and I can't imagine anyone wanting to deal with that complexity. A flat key=value store avoids this by allowing multiple writers to lay exclusive claim to their part of the keyspace, but it doesn't absolve the need for access control.
Yeah, it seems that on the Jail Call of 26-Nov we came to a conclusion that for now we would keep it very simple like two buffers per jail managed from the user-land side, while keeping a wide spectrum of opportunities to extend it in the future having a more specific production need in mind. Thus, we can postpone thinking about extra complexity on the kernel side.
The buffer char set limiting is going to be added soon to this patch. As we discussed we could start with the better defaults for the sake of security and allow a very limited set of chars, while there is an opportunity to provide some configuration knobs in the future to widen the set if needed.
If I had managed to make that call, I have to say that at least the conclusion wouldn't have been unanimous. The single blob is indeed simpler to implement, but not simpler to use. Still, you mention extension, and that's a possibility. I'm thinking something like:
meta="foo=bar\0baz=bletch"
meta.foo="bar"
meta.baz="bletch"
Speaking of which, I like "meta" for per-jail private metadata, and "env" for environment readable inside the jail.
Yes, it's just a summarization of the meeting, the discussion continues.
The jails have a nice architecture, you know it :-), which allows adding a new parameter without even changing the user-land: jail(8), jls(8), etc. Of course, currently it's not meant to support such dynamic things like "meta.<anything>". I like the idea, especially when it's supported by others, to teach user-land to treat those meta buffers special way and introduce an extra interface like "meta.<key>=<value>". This is exactly one of the ways I envisioned for extending, and the string manipulations are kept out of the kernel. I would like to share my current understanding of how such UI/UX would look like to have a broader discussion:
# Users can still do whatever they want with a buffer: > jail -c meta="anything..." # Users can opt-in to use the structure proposed by new versions of jail(8) # and jls(8), what means to have a buffer sliced by a <delimiter> onto # "key=value" strings. # All the examples below assume that the <delimiter> is '\n' char. > jail -c persist meta.key1="value1" meta.key2="value2" > jls meta key1=value1 key2=value2 > jls meta.key2 value2 # Modify and append: > jail -m meta.key2="newvalue" meta.key3="v3" > jls meta key1=value1 key2=newvalue key3=v3 # We could support the existing syntax style for key=value pair removal: > jail -m nometa.key1 nometa.key3 > jls key2=newvalue
My current gut feeling is that \n newline char is the best delimiter to have the widest interoperability. And the kernel side, especially vfsopt subroutines, is like already based on a single NULL-terminated string. So that users can still work with a buffer as a human-readable text file and leverage key-based support of jail(8) and jls(8):
# Users may have metadata coming from an external source or a file: > cat a.conf k1=one k2=two > jail -m meta="$(cat a.conf)" > jls meta.k2 two
All the above seem to outline a possible next project/patch to extend the initial idea of "just a blob of text". What do you think?
Speaking of which, I like "meta" for per-jail private metadata, and "env" for environment readable inside the jail.
Probably, I'm missing something here, but have you meant it like "meta" param for parent jails and "env" param for the jail itself readable via sysctl security.jail.env? Like jail -c meta="tagging..." env="configuring...".
The latest update aggregates the recent discussions:
- Rename metaext/metaint to meta/env. meta is expected to be used for "tagging" and hidden from the jail. env is intended for "configuring" and readable by the jail through security.jail.env sysctl.
- The allowed chars for each buffer are very limited by default, it covers Base64, k=v\n format, and some extra bytes. It can be changed via security.jail.meta_allowedchars sysctl. For convenience (as it seems to me for now), setting it to an empty string allows everything.
- The tests and man page are upgraded respectively.
Why is this a kernel issue? Aside from NUL, because it preserves the C-string nature, allowed characters would seem only to be a concern on the user side.
Yes, that's a good point. Indeed, it feels like a user-land concern by default. If we agree to limit it by default for extra security, then having it in a single place covers all use cases and entry points: jail(8), flua, libjail, direct syscall, and so on -- it seems preferable to addressing it separately for each existing and future syscall consumer. What do you think? Does this approach offer additional benefits for us?
@jamie thanks for the comments. The thinking for restriction of characters is this:
- the most common expected use case is ascii character set, used in jails tools & shell scripts
- if the stored data can be binary, we introduce a risk of shell escapes in a place where root privileges are often used
- and every single program & tool needs to implement its own poorly secured, insufficiently tested, parsing,
- it is simpler & cleaner to restrict character set in 1 place, consistently
- that doesn't have to be in the kernel, it can be in userland, e.g. in a c library that can be used by jls(8) jail(8) jail(3lua) and libxo variants thereof.
There are currently a wide variety of strings in jail parameters, and for that matter many more elsewhere in the kernel. So far, we have gotten by with counting on administrators putting reasonable values in them.
while testing jls output with new tags, I noted that the manpage says " Each jail is represented by one row which contains space-separated values of the listed parameter". Which is not quite the case as many parameters could have an embedded new line, that's not trimmed out.
So for meta/env we could either:
- amend the manpage to match the existing reality (and then meta/env would just follow the existing inconsistent behaviour)
- swap all newline for spaces during jls line-based output (not in libxo mode)
I added a PR for this issue, https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=283414
sys/kern/kern_jailmeta.c | ||
---|---|---|
426 | Who is the "someone else" who was quicker? This section is run with an exclusive lock held - you've even started the function by asserting that. You may unlock the prison, but you never unlock allprison_lock (which of course you shouldn't). |
sys/kern/kern_jailmeta.c | ||
---|---|---|
426 | Thank you very much for your time. Indeed, it may seem non-obvious, the comment could be improved. Currently, the reasoning is as follows. Jail creation/update is under the giant allprison_lock, while the OSD framework only expects protection of its struct osd. In our case this is pr_osd which is protected by pr_mtx. Usually, we tend to avoid giant locks when there is an optimization demand, and there has been increased discussion about cases with a large number of jails. For example, new logic could be introduced for mass manipulation or mass reset of metadata, which is optimal to be done with per-jail locking (pr_mtx) without allprison_lock. It might be premature to address this now, but it's easier to implement it these days while the context is still fresh in my mind -- avoiding the potential need to analyze hard-to-diagnose defects caused by unexpectedly overridden metadata due to interference from new logic. And it's not too much complexity, a relatively simple repeat. What do you think? |
sys/kern/kern_jailmeta.c | ||
---|---|---|
426 | Makes sense, if there are plans to go that direction. Yes, the comment could be improved, notably to mention the possible future use case. But I still don't see that use case. Surely, for something done with mass manipulation of data, the "mass" in question would be all jails or some subset, which would still require allprison_lock. I see perhaps a non-mass use case, where you would change the environment in some fashion other than jail_set, though I don't see offhand what that other method would be, or why you want it. I know that reading can happen outside the jail system (the sysctl to read the jail's own env), but not writing. |
sys/kern/kern_jailmeta.c | ||
---|---|---|
426 | Thank you, the comment has been improved. |