Page MenuHomeFreeBSD

Make pipe(2) & Co bullet-proof again (PR 272332)
Needs ReviewPublic

Authored by sobomax on Aug 29 2024, 7:25 AM.
Referenced Files
F108748287: D46472.id142615.diff
Mon, Jan 27, 5:00 PM
F108747686: D46472.id142582.diff
Mon, Jan 27, 4:57 PM
Unknown Object (File)
Thu, Jan 23, 6:52 PM
Unknown Object (File)
Thu, Jan 23, 12:00 AM
Unknown Object (File)
Wed, Jan 22, 11:56 PM
Unknown Object (File)
Sun, Jan 19, 2:45 PM
Unknown Object (File)
Sun, Jan 12, 7:12 AM
Unknown Object (File)
Fri, Jan 10, 2:19 PM
Subscribers

Details

Reviewers
kib
kevans
mjg
Summary

There is a quite nasty condition that can happen particularly if some program misbehaves and eats all the kva space, leaving affected system live-locked and admin with no breadcrumbs to follow. In general, whichever pid is eating it out should be killed with extreme prejudice, since right now it's basically a system-wide deadlock implementation, as it blocks any fork+exec from completing. :(

As documented: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=272332

Upon further investigation it seems that the root of the issue is some recent changes that would shift error to allocate pipe buffer into the pipe(2), from the write() handler (c6d3d601c). Arguably, pipe()&friends being in the critical path of typical UNIX life, it needs to be absolutely robust. If some presumably user-level code cannot deal with failures of the write() later on, well, that's the problems of that code and should be handled there.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sobomax edited the summary of this revision. (Show Details)
sys/kern/sys_pipe.c
569

I do not object, but you could be even more nice to user by including p_comm and possibly jail id. See, for instance, killproc() logging.

Also I am not sure about usefulness of the tuning(7) reference, kernel should not lecture user.

sobomax edited the summary of this revision. (Show Details)

Added option to kill Pipe KVA (ab)users.

@kib thanks for the input. Beefed up telemetry, as suggested, removed the reference to tuning. Added option to kill the repeated offender (or an innocent bystander ;-)

sys/kern/sys_pipe.c
535

There is no precedent in such behavior, where kernel abruptly terminates user processes only because some shared resource was exhausted.

Note that OOM is different: there kernel cannot make progress, which is why it has to try to kill something to leave other things running. Similarly RLIMIT_FSIZE is not a counter-example to my claim, the limit is specifically configured with known behavior.

Kernel should do nothing more than returning an error to the caller there. Message is fine.

sys/kern/sys_pipe.c
535

There is no precedent in such behavior, where kernel abruptly terminates user processes only because some shared resource was exhausted.

Note that OOM is different: there kernel cannot make progress, which is why it has to try to kill something to leave other things running. Similarly RLIMIT_FSIZE is not a counter-example to my claim, the limit is specifically configured with known behavior.

Kernel should do nothing more than returning an error to the caller there. Message is fine.

This distinction is a bit of in an eye of beholder, I think. Yours seems to be a bit blind to this (somewhat understandably). In my eye (and this is what brought me here), as a systems operator, both are in the very same category. Which is the category of things that would render my system inoperable and irrecoverable without major intervention (i.e. hard reset). And in fact THIS issue is more dangerous since for OOM we have a certain limits, however crude, that can be enforce to limit exposure at least for the unprivileged user. For this one we have none. Like literally nothing to prevent unprivileged user to take the whole system down. With just one syscall and an unconditional jump instruction.

IMHO the sysctl is useful if for nothing else but to annoy authors of the recent preallocate feature (pun intended) to put an extra effort into making sure this (probably) useful optimization does not blow up people systems, which were running perfectly fine for 10+ years before this change has been made.

As I already noted in the bug report, this is also a security concern. Making life of someone wanting to DOS bunch of FreeBSD boxes using some zero-day so much easier.

P.S. I think this whole new "behaviour" is the (unintended?) consequence of the c6d3d601c, which shifted the failure to allocate buffer from the write() call into the popen() call. Which added the whole new failure mode into fork()->exec() mechanism that can be easily triggered by anybody. So no new process can be created even if the new process and/or its parent would never want to write or read to the pipe in a normal conditions.

Interestingly enough, we seems to be fixing the very similar issue back 10 years ago. And I think the shift of the pipe_paircreate() from returning void to int was a mistake that needs to be undone.

commit 183870cf7574b96444d4fc0e5d5203d303ed53a0
Author: Mateusz Guzik <mjg@FreeBSD.org>
Date:   Fri May 2 00:52:13 2014 +0000

    Ignore the error from pipespace_new when creating a pipe.

    It can fail if pipe map is exhausted (as a result of too many pipes created),
    but it is not fatal and could be provoked by unprivileged users. The only
    consequence is worse performance with given pipe.

    Reported by:    ivoras
    Suggested by:   kib
    MFC after:      1 week

diff --git a/sys/kern/sys_pipe.c b/sys/kern/sys_pipe.c
index 6ba52e37c5f2..f3c3c0e01d61 100644
--- a/sys/kern/sys_pipe.c
+++ b/sys/kern/sys_pipe.c
@@ -221,8 +221,8 @@ SYSCTL_INT(_kern_ipc, OID_AUTO, piperesizeallowed, CTLFLAG_RW,
 static void pipeinit(void *dummy __unused);
 static void pipeclose(struct pipe *cpipe);
 static void pipe_free_kmem(struct pipe *cpipe);
-static int pipe_create(struct pipe *pipe, int backing);
-static int pipe_paircreate(struct thread *td, struct pipepair **p_pp);
+static void pipe_create(struct pipe *pipe, int backing);
+static void pipe_paircreate(struct thread *td, struct pipepair **p_pp);
 static __inline int pipelock(struct pipe *cpipe, int catch);
 static __inline void pipeunlock(struct pipe *cpipe);
 #ifndef PIPE_NODIRECT
@@ -331,12 +331,11 @@ pipe_zone_fini(void *mem, int size)
        mtx_destroy(&pp->pp_mtx);
 }

-static int
+static void
 pipe_paircreate(struct thread *td, struct pipepair **p_pp)
 {
        struct pipepair *pp;
        struct pipe *rpipe, *wpipe;
-       int error;

        *p_pp = pp = uma_zalloc(pipe_zone, M_WAITOK);
 #ifdef MAC
@@ -355,30 +354,21 @@ pipe_paircreate(struct thread *td, struct pipepair **p_pp)
        knlist_init_mtx(&wpipe->pipe_sel.si_note, PIPE_MTX(wpipe));

        /* Only the forward direction pipe is backed by default */
-       if ((error = pipe_create(rpipe, 1)) != 0 ||
-           (error = pipe_create(wpipe, 0)) != 0) {
-               pipeclose(rpipe);
-               pipeclose(wpipe);
-               return (error);
-       }
+       pipe_create(rpipe, 1);
+       pipe_create(wpipe, 0);

        rpipe->pipe_state |= PIPE_DIRECTOK;
        wpipe->pipe_state |= PIPE_DIRECTOK;
-       return (0);
 }

Whatever given process needs to have a at least one pair created should be pre-allocated at the process initialization time IMHO. It's like pair of socks. Yes you can do without it, but not really. At least one pair, probably two or three.

sobomax marked an inline comment as done.
sys/kern/sys_pipe.c
535

OOM and this situation are completely different: in OOM case system (kernel) cannot make progress, and user cannot free resources by e.g. killing processes. There, user can.

Your patch causes rejection because it introduces random uncontrolled data loss in the system.

sys/kern/sys_pipe.c
535

Well, now that kill code is out let's focus on the root of the issue.

Your earlier change in the c6d3d601c introduced possibility of random uncontrolled lockup of the system. I think that was a mistake and needs to be undone. Why should a process be refused a pipe allocation if it might not even want to write to that pipe now or ever? The portion that made popen() return an error was a mistake and needs to be undone. Take kill(1) utility as a perfect example. In a normal mode of operation it does not read from stdin, it does not write to stdout nor stderr. But in order for kill(1) to start, the shell needs to create a pipe and do usual fork/exec dance.

sobomax retitled this revision from Add little debug to help tracking pipebombs to Make popen() bullet-proof again (PR 272332).Sep 6 2024, 11:04 PM
sobomax edited the summary of this revision. (Show Details)
sobomax edited the summary of this revision. (Show Details)

Undo damage done in c6d3d601c96f, which shifted failure to allocate large buffer to pipe_create() from pipe_write().

The need for such allocation by itself is questionable since absolute majority of users write less than dozen bytes and out of that majority most write exactly 0.

But that shift created a problem by itself, so that when the shared resource is exhausted no new pipes can be created thus preventing any meaningful activity in the system.

sobomax retitled this revision from Make popen() bullet-proof again (PR 272332) to Make pipe(2) & Co bullet-proof again (PR 272332).Sep 9 2024, 2:55 PM

This undoes the fix done in c6d3d601c96f5836df76847, which hurts real users.

Whatever band-aids you would try to put on top of the issue, it cannot solve the fundamental inability of computers to allocate infinite amount of resources. Perhaps, the only clean way to isolate one user from mal-acting neighbor is to implement per-uid pipe limit (RLIMIT_PIPEN), which is not that useless IMO.

In D46472#1062498, @kib wrote:

This undoes the fix done in c6d3d601c96f5836df76847, which hurts real users.

Whatever band-aids you would try to put on top of the issue, it cannot solve the fundamental inability of computers to allocate infinite amount of resources. Perhaps, the only clean way to isolate one user from mal-acting neighbor is to implement per-uid pipe limit (RLIMIT_PIPEN), which is not that useless IMO.

D46619

Undo damage done in c6d3d601c96f, which shifted failure to allocate large buffer to pipe_create() from pipe_write().

The need for such allocation by itself is questionable since absolute majority of users write less than dozen bytes and out of that majority most write exactly 0.

I tend to agree in principle, but it seems that D23993 was done in reaction to problems in application that get an error at write time and are not prepared to handle it. According to POSIX, such applications actually should (it says, e.g., that write() can return ENOBUFS), but hey, we don't live in an ideal world.

But that shift created a problem by itself, so that when the shared resource is exhausted no new pipes can be created thus preventing any meaningful activity in the system.

This will be mostly prevented by @kib's D46619, but I don't think it's enough because the (automatic or manual) tuning of kern.ipc.maxpipekva may not match the configured maximum limit for RLIMIT_PIPE (which may be arbitrary; and currently is just infinity).

To make these fixes complete, there should be a fallback to allocate memory, e.g., via malloc(), when the pipe map is exhausted, as I believe @mjg proposed via IRC.