Details

lib/libsys/getrlimit.2
90–94	The syntax here seems bogus. I've suggested a reformulation based on what I understood, might not be what you had in mind. (And, yes, accounted to the user ID of the process is a small lie, but should become true at some point.)
125–127	Unrelated to the rest, but thanks for adding it, I had spotted doc was missing while working on CAP-01 (D46126), but forgot to add it.

This revision is now accepted and ready to land.Sep 10 2024, 12:06 PM

In D46619#1062591, @mjg wrote:

should not this instead limit how much memory is used to back pipe buffers?

But then it would depend on the values of SMALL_PIPE_SIZE and PIPE_SIZE. This makes things less easy and fragile for administrators.

I suspect we also want an addition to lib/libutil/login_class.c, and maybe a default in usr.bin/login/login.conf

In D46619#1062645, @kevans wrote:

I suspect we also want an addition to lib/libutil/login_class.c

Isn't the current addition of a line in the resources[] array in login_class.c enough?

and maybe a default in usr.bin/login/login.conf

Most probably. Also, the current kernel's default is RLIM_INFINITY, which kind of defeats the spirit of this change.

In D46619#1062675, @olce wrote:

In D46619#1062645, @kevans wrote:

I suspect we also want an addition to lib/libutil/login_class.c

Isn't the current addition of a line in the resources[] array in login_class.c enough?

Whoops, my eyeballs skipped over that. Yes, that's good.

In D46619#1062602, @olce wrote:

In D46619#1062591, @mjg wrote:

should not this instead limit how much memory is used to back pipe buffers?

But then it would depend on the values of SMALL_PIPE_SIZE and PIPE_SIZE. This makes things less easy and fragile for administrators.

Much more important is that exposing pipe map usage as limit would make the internal implementation detail a part of the ABI. I think that rough estimation of how much pipes can be allowed for maligned user is enough to mitigate the problem. And it is invariant against possible changes to the pipe implementation.

In D46619#1062675, @olce wrote:

Most probably. Also, the current kernel's default is RLIM_INFINITY, which kind of defeats the spirit of this change.

No, it is not. If admin wants to limit this, it can. But really that was the first ever complain, so why impose artificial limits on everybody? Same as for kqueue/pshared/pts.

This would not expose any detail apart from the fact that pipes have to allocate space to hold data, which is an integral part of the mechanism.

Note this is not necessarily a pipe map limit specifically. If one was to reimplement the code to do away with the map altogether and instead allocate bufs in page-sized (or whatever other size) chunks the limit would apply the same way.

I just checked Linux and they have a global page counter with a some amount reserved for root.

So in my proposal the limit could be in -- say -- kilobytes and for now demanding multiplies of page size to be accepted.

Switch to limit pipe buffers bytes

This revision now requires review to proceed.Sep 11 2024, 9:21 AM

In D46619#1062784, @kib wrote:

In D46619#1062602, @olce wrote:

But then it would depend on the values of SMALL_PIPE_SIZE and PIPE_SIZE. This makes things less easy and fragile for administrators.

Much more important is that exposing pipe map usage as limit would make the internal implementation detail a part of the ABI.

Not wanting to tie configuration to implementation details that are likely to be changed is exactly the assumption behind my comment above. I thought this was clear enough.

And the *pipe map* implementation specifically is irrelevant here, as we could replace it with others that still rely on SMALL_PIPE_SIZE and PIPE_SIZE, and that would still be an implementation leak.

In D46619#1062786, @kib wrote:

In D46619#1062675, @olce wrote:

Most probably. Also, the current kernel's default is RLIM_INFINITY, which kind of defeats the spirit of this change.

No, it is not. If admin wants to limit this, it can. But really that was the first ever complain, so why impose artificial limits on everybody? Same as for kqueue/pshared/pts.

Then the change here definitely does not help in fixing the problem that prompted D46472 at all. Mitigating it, yes, provided the administrator is aware of the problem and sets a limit.

Are you really saying that the security or reliability problems only have to be taken seriously if reported multiple times? D46472 is a very simple but highly effective local denial-of-service attack.

So either we must have a limit here, or we must change the current pipe mechanisms so as to continue working even if the pipe map is exhausted by performing regular allocations (until OOM strikes).

The kqueue, pshared or pts examples are similar, but different in one aspect: If they are not limited, creating more of them is still limited by memory allocation. We can, and should, set default limits for all of these that are much higher than the current usage for a variety of processes, with the goal to both have safeguards (even if the system can continue working, if it does so at a very slow pace, then in practice it is just dead) and not to have to revise these limits every so often. At worse, a too high limit with respect to the actual system resources involved is effectively like RLIMIT_INFINITY. It's much better to have high, imperfect limits, than to have none.

In D46619#1062788, @mjg wrote:

This would not expose any detail apart from the fact that pipes have to allocate space to hold data, which is an integral part of the mechanism.

Even that is an implementation detail.

So in my proposal the limit could be in -- say -- kilobytes and for now demanding multiplies of page size to be accepted.

That there are constraints in the values that can be set would be even harder on the administrator, and I don't see any reason to impose that (if anything, we could do the truncation internally).

I think the only reason that may prompt considering an amount of space as a limit instead is if we give the user control over the pipe size (for semantics or performance reasons, however bogus they are). This is not the case today, and more importantly, there are alternatives to solve this case: a tunable generally limiting user-specified pipe sizes, or even another per-user limit for that precise quantity (I don't think we'll ever need to go that far).

So I don't see any compelling reason to have a limit in amount of space.

In D46619#1062938, @kib wrote:

Switch to limit pipe buffers bytes

For the reasons already exposed, this seems a bad idea. So, please revert to the previous version or thoroughly explain the benefit of doing this.

This revision now requires changes to proceed.Sep 11 2024, 10:27 AM

I agreed to mjg' proposal after I agreed with the statement that limit of size can stay even if the underlying implementation changes. It is more targeted to the cause of the problem.
With the pipe count limit, the administrator has to know more implementation details, e.g. it has to multiply the limit count by PIPE_SIZE to get what the approximate limit in buffer size is.

I do not want to change defaults (i.e. set the limit in limits.conf) exactly because this 'problem' is not a problem in real life.

In D46619#1063012, @kib wrote:

I agreed to mjg' proposal after I agreed with the statement that limit of size can stay even if the underlying implementation changes. It is more targeted to the cause of the problem.

"can stay even" doesn't mean it will stay even. On the contrary, it seems likely that the current implementation will change, and that, years passing, we may want to change the constants to adapt to common hardware.

I do not want to change defaults (i.e. set the limit in limits.conf) exactly because this 'problem' is not a problem in real life.

The initial problem (from D46472) is one of resource exhaustion from user space to the point an administrator cannot recover the machine. How could this not be a real life problem, and a security/reliability at that? Or do you disagree with my (and @sobomax's, IIUC) characterization?

I think there is a confusion for the reason of the change here.

The introduction of RLIMIT_PIPE is not a real fix for the initial problem alone, because:

By default, no limit is set, which means the mechanism is opt-in. For the kind of problem we are trying to solve (if you agree with the characterization above), this is unacceptable.
Even if a limit is set, there is no guarantee that it matches the current value of kern.ipc.maxpipekva, or whatever else will come with future tweaks or implementation changes.
Even if the default limit matches that (and we manage to guarantee that in the long term), the administrator is free to change it however he wants, and in particular set a higher limit not preventing resource exhaustion and taking down the machine.

At best, RLIMIT_PIPE is a mitigation.

The most straigtforward real fix, given the current implementation (I haven't dug in it too much), seems to be to make sure pipe buffers can continue to be allocated even when the pipe map is exhausted, with these allocations subject to OOM, thereby avoiding unrecoverable situations (as long as OOM works).

I agree with introducing RLIMIT_PIPE, out of pragmatism (the mitigation aspect) but also as I see more value in it in the long term. As for other resource limits, it allows an administrator to somehow contain the damage that runaway processes could do to the system in terms of performance. But doing so is profoundly different than having to solve a security issue. Trying to limit performance degradation is great, but more of a nice to have, whereas solving the security/reliability issue is compulsory.

With the view of preventing performance degradation only, I agree not setting a default for RLIMIT_PIPE is probably OK, although I still think setting some reasonable default (1000, or even 10,000, or a fraction of the number of open files, that can be discussed) is better, but can be done as a later step.

With the pipe count limit, the administrator has to know more implementation details, e.g. it has to multiply the limit count by PIPE_SIZE to get what the approximate limit in buffer size is.

I think you're having it backwards.

In most case, an administrator will not care at all about the actual size of pipe buffers or the actual implementation. He will be probably much more interested in setting a limit of the number of pipes, exactly as today we can set a limit to the number of open files. It's unlikely most programs ever use more than a hundred of pipes. Even with a much more conservative number as examplified above, the default pipe map size is not reached. So, in practice, setting RLIMIT_PIPE to a reasonable value reflecting current usage by applications will act as a mitigation, without the need to compute or understand anything more. If the administrator applies more tuning, that may not be true, but then he has to know about the implementation anyway. Additionally, tooling is readily available to see the number of pipes used by processes.

On the contrary, having to set a byte limit will force the administrator to figure out what PIPE_SIZE and know a bit about the implementation to be able to know how many pipes a process can open in which specific system circumstances. And he will have to revise all this knowledge when constants or the implementation are changed, and change his machines' tuning accordingly.

So I still think having to specify a space amount in RLIMIT_PIPE is worse than a simple number of pipes.

I still think we want the login.conf addition, even if it's pipe=unlimited -- we seem to do this for all of them (perhaps as an example of the possible limits that can be configured and their login.conf names?)

In D46619#1063049, @olce wrote:

In D46619#1063012, @kib wrote:

I agreed to mjg' proposal after I agreed with the statement that limit of size can stay even if the underlying implementation changes. It is more targeted to the cause of the problem.

"can stay even" doesn't mean it will stay even. On the contrary, it seems likely that the current implementation will change, and that, years passing, we may want to change the constants to adapt to common hardware.

I do not want to change defaults (i.e. set the limit in limits.conf) exactly because this 'problem' is not a problem in real life.

The initial problem (from D46472) is one of resource exhaustion from user space to the point an administrator cannot recover the machine. How could this not be a real life problem, and a security/reliability at that? Or do you disagree with my (and @sobomax's, IIUC) characterization?

I think there is a confusion for the reason of the change here.

The introduction of RLIMIT_PIPE is not a real fix for the initial problem alone, because:

By default, no limit is set, which means the mechanism is opt-in. For the kind of problem we are trying to solve (if you agree with the characterization above), this is unacceptable.

Even if a limit is set, there is no guarantee that it matches the current value of kern.ipc.maxpipekva, or whatever else will come with future tweaks or implementation changes.

Even if the default limit matches that (and we manage to guarantee that in the long term), the administrator is free to change it however he wants, and in particular set a higher limit not preventing resource exhaustion and taking down the machine.

At best, RLIMIT_PIPE is a mitigation.

The most straigtforward real fix, given the current implementation (I haven't dug in it too much), seems to be to make sure pipe buffers can continue to be allocated even when the pipe map is exhausted, with these allocations subject to OOM, thereby avoiding unrecoverable situations (as long as OOM works).

I agree with introducing RLIMIT_PIPE, out of pragmatism (the mitigation aspect) but also as I see more value in it in the long term. As for other resource limits, it allows an administrator to somehow contain the damage that runaway processes could do to the system in terms of performance. But doing so is profoundly different than having to solve a security issue. Trying to limit performance degradation is great, but more of a nice to have, whereas solving the security/reliability issue is compulsory.

With the view of preventing performance degradation only, I agree not setting a default for RLIMIT_PIPE is probably OK, although I still think setting some reasonable default (1000, or even 10,000, or a fraction of the number of open files, that can be discussed) is better, but can be done as a later step.

With the pipe count limit, the administrator has to know more implementation details, e.g. it has to multiply the limit count by PIPE_SIZE to get what the approximate limit in buffer size is.

I think you're having it backwards.

In most case, an administrator will not care at all about the actual size of pipe buffers or the actual implementation. He will be probably much more interested in setting a limit of the number of pipes, exactly as today we can set a limit to the number of open files. It's unlikely most programs ever use more than a hundred of pipes. Even with a much more conservative number as examplified above, the default pipe map size is not reached. So, in practice, setting RLIMIT_PIPE to a reasonable value reflecting current usage by applications will act as a mitigation, without the need to compute or understand anything more. If the administrator applies more tuning, that may not be true, but then he has to know about the implementation anyway. Additionally, tooling is readily available to see the number of pipes used by processes.

On the contrary, having to set a byte limit will force the administrator to figure out what PIPE_SIZE and know a bit about the implementation to be able to know how many pipes a process can open in which specific system circumstances. And he will have to revise all this knowledge when constants or the implementation are changed, and change his machines' tuning accordingly.

So I still think having to specify a space amount in RLIMIT_PIPE is worse than a simple number of pipes.

Initially I also thought that using pipe numbers as limit is simpler to understand by a person who suddenly cares about this kind of local DoS. But really, the need to know PIPE_SIZE etc exposes the implementation internals in much worse way than a note of pipe buffers, I agree with mjg there. Which is the reason I changed my position: any implementation needs pipe buffers in KVA.

And yes, it is mitigation. So far there is no fix for the problem of finite amount of computing resources.

I will go with the current patch.

Rename RLIMIT_PIPE to RLIMIT_PIPEBUF.
Handle procstat(1)

keeping some % of reserve for root would be good,, equivalent to _falloc_noinstall

In D46619#1063775, @mjg wrote:

keeping some % of reserve for root would be good,, equivalent to _falloc_noinstall

This cannot be done with limits. It is very different thing, requiring changes to vm_map_find().

This revision was not accepted when it landed; it landed in state Needs Review.Sep 15 2024, 6:31 AM

Closed by commit rG3a2a5d606082: getrlimit(2): document RLIMIT_UMTXP (authored by kib). · Explain Why

This revision was automatically updated to reflect the committed changes.

kib added a commit: rG3a2a5d606082: getrlimit(2): document RLIMIT_UMTXP.

kib added a commit: rGe3b777534bae: procstat(1): make rlimit_param[] size assertion automatic.

kib reopened this revision.Sep 15 2024, 6:34 AM

reserve configured percentage of buffers zone to superuser

kib added a parent revision: D46678: vm_map: add vm_map_find_locked(9).Sep 15 2024, 7:05 AM

kib mentioned this in D46678: vm_map: add vm_map_find_locked(9).Sep 15 2024, 5:51 PM

alc added a subscriber: alc.Sep 15 2024, 6:03 PM

alc added inline comments.

lib/libsys/getrlimit.2
88
93

Misc bugfixes. The patch seems to work in expected way.

Initially I also thought that using pipe numbers as limit is simpler to understand by a person who suddenly cares about this kind of local DoS. But really, the need to know PIPE_SIZE etc exposes the implementation internals in much worse way than a note of pipe buffers, I agree with mjg there.

I explained in my previous post why an admin wanting to mitigate the DoS doesn't currently need to know PIPE_SIZE or the implementation. He just has to set the number of pipes to a reasonable value for current processes (a few hundreds to thousands, depending on what kind of stuff he runs), and this acts as a mitigation in most practical cases (i.e., unless the machine doesn't have much memory, less than 1GB to give a rough idea).

Additionally, the DoS mitigation case is relevant only pending a true fix in the pipe implementation (see below).

Which is the reason I changed my position: any implementation needs pipe buffers in KVA.

And that's a bad reason, because:

Of what I've just wrote about the DoS mitigation.
The fact that the proper limit in a longer-term perspective (limiting applications to conserve performance of the overall system) is the number of pipes, which doesn't depend on or assumes anything about the implementation.

And yes, it is mitigation. So far there is no fix for the problem of finite amount of computing resources.

That might sound funny, but saying there is no such "fix" is missing the point. And in reality you know it, since as per @mjg's suggestion, you've just added pipebuf_reserve in order to at least allow root to do something on an otherwise deadlocked machine, which is a much better behavior that is additionally enabled by default.

There are probably even better things to do. One would be to make sure that the system always continue to run, albeit in degraded mode, by having the pipe implementation fallback on swapable memory (and doing more copies), with the added benefit that runaway processes would be subject to OOM. Another would be to code some OOM-like mechanism for the pipe map. I can take care of that at a later time.

These would render the use of RLIMIT_PIPEBUF obsolete as a mitigation for the DoS. For the performance degradation limitation part, RLIMIT_PIPE seems more adapted since it can be based on actual process use, which is measurable, and can be used without having to know anything about the implementation.

I will go with the current patch.

That's most probably a system design mistake (except pipebuf_reserve, which is good).

The number of pipes should be the commanding control, what the administrators have to set for their processes, and what we developers should take into account as one of the factors when deciding the size of the pipe map or PIPE_SIZE, not the other way around. The administrators should tune to the needs of their applications, and if they then encounter a performance problem, they'll either tune down (or use more machines) or ask the developers to do something as much as possible (improvements to the implementation, more tuning knobs, etc.). And we should have a deadlock prevention mechanism regardless of tuning. pipebuf_reserve is a minimal one, as it requires the administrator to intervene immediately at the point of deadlock instead of the system continuing with degraded performance.

What you are saying instead is, only slightly exaggerating: We don't provide defaults (no safety net), and good luck if all the pipe map is exhausted (be sure to be right next to the console (enabled by pipebuf_reserve; before that, that was: just restart your machine, and to hell with all your processes)). In case you really care about your machine's work (why would you?), in all magnanimity, we give you a tunable in bytes that you'd better figure out how to set correctly (this depends on your applications' actual needs in your environment, which are not in pipe bytes but in number of pipes; if you set it too high, it is as if no limit was specified (jump back to start); please learn about PIPE_SIZE and our current implementation's auto-tuning mechanisms (pipe map KVA size; resizing of pipe buffers); be prepared to learn it again and change your tuning at the next change in behavior).

Well, if you don't see any problem here, this discussion is hopeless.

If you choose to go ahead anyway, at least basic documentation on how to tune the limit would be better.

This revision now requires changes to proceed.Sep 16 2024, 12:33 PM

This revision was not accepted when it landed; it landed in state Needs Revision.Sep 20 2024, 6:47 AM

Closed by commit rG54a8d1fbbf65: getrlimit(2): document RLIMIT_PIPEBUF (authored by kib). · Explain Why

This revision was automatically updated to reflect the committed changes.

kib added a commit: rG54a8d1fbbf65: getrlimit(2): document RLIMIT_PIPEBUF.

kib added a commit: rG3458bbd39778: kernel: add RLIMIT_PIPEBUF.

kib added a commit: rGa4c04958f526: libutil: support RLIMIT_PIPEBUF.

kib added a commit: rG5d92f20c7d31: bin/sh: support RLIMIT_PIPEBUF.

kib added a commit: rGf54f41403d14: usr.bin/limits: support RLIMIT_PIPEBUF.

kib added a commit: rGb029e29e0d8b: login.conf: add a placeholder for the pipebuf limit.