Page MenuHomeFreeBSD

libc: add fdwalk
AbandonedPublic

Authored by mjg on Aug 10 2019, 2:53 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 19, 9:02 AM
Unknown Object (File)
Sun, Jan 19, 5:55 AM
Unknown Object (File)
Sat, Jan 18, 11:59 PM
Unknown Object (File)
Sat, Jan 18, 4:44 AM
Unknown Object (File)
Sat, Jan 18, 4:21 AM
Unknown Object (File)
Fri, Jan 17, 10:42 PM
Unknown Object (File)
Mon, Dec 23, 11:16 PM
Unknown Object (File)
Dec 19 2024, 7:23 AM

Details

Reviewers
kib
jhb
jhibbits
Summary

fdwalk(3) originated in SunOS/Solaris, and is commonly used in Linux
software, including GNOME-based software, like glib and vte. It walks
the current file descriptor list and runs an arbitrary function on each
file descriptor.

Test Plan

Would be nice to have one

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

Ultimately we want some other sysctl to get the list of filedescs, because kern.proc.filedesc performs a huge amount of the work that is discarded.

lib/libc/stdlib/fdwalk.3
58

New line for new sentence. malloc(3) etc should be typesetted with .Xr.

61

Not only new fds not handled (really they could be handled depending on the value), but also closed and thus invalid fds might be passed to callback.

The man page lacks explanation of the interface for callback. In particular, what the return value means and how callback should handle errno to make the return state of fdwalk(3) parseable.

lib/libc/stdlib/fdwalk.c
51

if (error == -1)

53

This might be too strict in mt environment. We typically allocate more space than reported, to allow the attempt to succeed even if other thread opened a new file descriptor between query and fetch. More, sometimes we do not give up on the fetch failure but retry several times.

Address @kib's feedback. Man page will probably require several iterations
to get right. It's still a little muddy.

I noted in the source that we should add a new sysctl to give us just the
descriptors. I added a retry, with a 5-try limit (completely arbitrary).
I'm not sure, and would like some input, on what to do if the retry count is
reached. We could walk as many as we have room for, or we could exit with
error. Right now it exits with error.

It's not hard to add such a sysctl. If you want I can hack it up for you. I think it should just export the fd bitmap, then iteration over it is trivial.

In D21206#461101, @mjg wrote:

It's not hard to add such a sysctl. If you want I can hack it up for you. I think it should just export the fd bitmap, then iteration over it is trivial.

Thanks. I've got one half done now. It's probably less trivial than your thought. It's a subset of the filedesc handler, iterating over and building a sbuf of fd's. If you think just returning the fdmap is better, go for it.

Add a new sysctl, kern.proc.fdmap.<proc>, to provide the bitmap. Walk the bitmap instead of the file descriptor list.

lib/libc/stdlib/fdwalk.c
51

You can use 0 to signify the current process and avoid getpid altogether.

83

I don't think the retry logic is useful. Instead you can pass a "typically big enough" buffer. Should it be not big enough, sysctl can return the real size along with ENOMEM in one call. Then you retry once with the new size.

This avoids the initial call to get the size in the common case and retrying like in the above is very unlikely to be of concern for real-world programs - the bitmap itself would have to grow between calls.

97

Is this testing tmp twice? I think it would be nicer to put this stuff inside the loop and 'continue' to skip.

sys/kern/kern_descrip.c
3883

I don't think we should bother with arbitrary processes and even if we do, the common case of inspecting the current one can skip some of the work including locking the process and holding the table.

3893

This is legal but not recommended - should it block due to a page fault we can go off while holding the lock and blocking others for an indeterminate amount of time.

Preferably the code would do a copy with disallowing page faults and fallback to something else if necessary. I don't know if there is a handy way to do it with sysctl. It is possible to wire the target buffer but I don't think it's worth it.

So I think this would be good enough for now with making a local copy before doing SYSCTL_OUT. You can have a small on stack buffer to cover the common case and malloc in the worst case.

sys/kern/kern_descrip.c
3893

Also note if only the current process is of interest, the above concern goes out of the window. bitmaps are only freed when the entire struct is getting freed.

Then you can simply lock the struct, read off the pointer and the size, unlock the struct and SYSCTL_OUT.

sys/kern/kern_descrip.c
3893

roughly this (untested):

static int
sysctl_kern_proc_fdmap(SYSCTL_HANDLER_ARGS)
{
        struct filedesc *fdp;
        NDSLOTTYPE *map;
        size_t size;

        if (*(int *)arg1 != 0)
                return (EINVAL);

        fdp = curproc->p_fd;
        FILEDESC_SLOCK(fdp);
        /*
         * We can safely access the map after dropping the lock because it can
         * only get freed when the struct filedesc is getting freed.
         * See fdgrowtable.
         */
        size = NDSLOTS(fdp->fd_nfiles) * NDSLOTSIZE;
        map = fdp->fd_map;
        FILEDESC_SUNLOCK(fdp);

        return (SYSCTL_OUT(req, map, size));
}

static SYSCTL_NODE(_kern_proc, KERN_PROC_FDMAP, fdmap,
    CTLFLAG_RD|CTLFLAG_CAPRD|CTLFLAG_MPSAFE, sysctl_kern_proc_fdmap,
    "File descriptor map");
sys/kern/kern_descrip.c
3893

apologies for spam, scratch that - only fd tables survive, old maps do get freed so a copying it will be necessary. I stand by not doing copyout with the lock held.

Don't call getpid(), we can pass 0 now.

lib/libc/stdlib/fdwalk.3
45

and _calls_ the callback

64

This is not quite true. Also, on internal errors from malloc(3) or sysctl(3), you return -1 and errno contains some code.

Later you state 'errors are passed through' but I am not sure that this can be understood without looking at the code.

lib/libc/stdlib/fdwalk.c
47

No initialization in declaration.

88

I do not think there is a need to cast through uinptr_t.

sys/kern/kern_descrip.c
3892

Extra {}.

3901

I remember mjg already pointed out that malloc under FILEDESC_SLOCK is undesirable.

3905

Don't you need to free map ?

3909

space around '|' binary op.

jhibbits marked 6 inline comments as done.

Address @kib's feedback. The man page will probably need more iterating, but I
think the code is correct. It may not be optimal, though.

lib/libc/stdlib/fdwalk.c
68

this formula looks like a leftover from previous attempt. sizes are guaranteed to be power of 2 times slot size. Also note my previous remark about calling this with a typically big enough buffer first and only retrying if needed. Each time sysctl can return the expected size so there is no need for manual calculation. Note the map is very rarely resized.

sys/kern/kern_descrip.c
3887

This + fdhold + fddrop business is avoidable for curproc. It should be special cased.

3907

You can read off the size without taking the lock, do malloc and take the lock. Then verify the size fits (it will almost always will), if it does not free and retry. The table will very rarely change the size and it will always grow. I would use a small on stack buffer by default, sized just like the initial map to avoid malloc in the common case.

I'm also increasingly leaning towards extending sysctl api to add SYSCTL_OUT_NOFAULT or so to fast path and avoid all this crap in the common case. General support is already there (based on pre-wired mappings), the code just needs cosmetic changes + api extension. It should come in handy in other code. I'll probably put a review later. With this in place you can avoid the double copy. This can be taken care of at a later time if fdwalk is expected to land for 12.1

3917

This needs CTLFLAG_CAPRD so that it works in capability mode.

So I had a look at glib. What's the benefit of adding this function to libc? All consumers are going to link to glib anyway. Perhaps we should get the kernel part in and then submit a patch to glib to use it.

I had an extra look and there is an extremely worrisome comment: apparently the routine is expected to by async-signal safe. This means both malloc and realloc are off the table. This can be worked around with having a statically sized buffer and requesting the bitmap in chunks which fit the buffer, i.e. pass the buffer and an offset (size of the buffer * iteration counter). See https://gitlab.gnome.org/GNOME/glib/blob/master/glib/gspawn.c#L1182

I think I sufficiently hacked this for inclusion. First the kernel part, then fdwalk itself.

Full kernel diff is here: https://people.freebsd.org/~mjg/fdmap.diff

I thought sysctl api would always return the new len, unfortunately that's only the case when the oldptr is NULL. Perhaps this can be fixed at a later date.

So:

  • only support curproc for simplicity
  • if oldptr is NULL just return the size without populating anything. the read is no more racy than it would be with the lock held.
  • if the we are the only user of the table don't lock or try to cache anything, just copy it out as it is
  • for the rest make a copy but try to fit it in a local buffer if possible
static int
sysctl_kern_proc_fdmap(SYSCTL_HANDLER_ARGS)
{
        struct filedesc *fdp;
        NDSLOTTYPE lmap[NDSLOTS(NDFILE) * 16];
        NDSLOTTYPE *map;
        size_t bsize, csize;
        int error; 
        
        CTASSERT(sizeof(lmap) <= 128);
        
        if (*(int *)arg1 != 0)
                return (EINVAL); 
        
        fdp = curproc->p_fd;
                
        if (req->oldptr == NULL)
                return (SYSCTL_OUT(req, NULL, NFD2MAPSIZE(fdp->fd_nfiles)));
        
        if (curproc->p_numthreads == 1 && fdp->fd_refcnt == 1)
                /* No need to lock anything as only we can modify the map. */
                return (SYSCTL_OUT(req, fdp->fd_map, NFD2MAPSIZE(fdp->fd_nfiles)));
        
        /* Potential other threads modifying the map. */
        map = lmap;
        bsize = sizeof(lmap);
        for (;;) {
                FILEDESC_SLOCK(fdp);
                csize = NFD2MAPSIZE(fdp->fd_nfiles);
                if (bsize >= csize)
                        break;
                FILEDESC_SUNLOCK(fdp);
                if (map != lmap)
                        free(map, M_TEMP);
                bsize = csize;
                map = malloc(bsize, M_TEMP, M_WAITOK);
        }               
        memcpy(map, fdp->fd_map, csize);
        FILEDESC_SUNLOCK(fdp);
        error = SYSCTL_OUT(req, map, csize);
        if (map != lmap)
                free(map, M_TEMP);
        return (error);
}       
                
static SYSCTL_NODE(_kern_proc, KERN_PROC_FDMAP, fdmap,
    CTLFLAG_RD|CTLFLAG_CAPRD|CTLFLAG_MPSAFE, sysctl_kern_proc_fdmap,
    "File descriptor map");

There was api incompatibility in fdwalk - it is supposed to always return 0 on its own errors.

If bothering with retrying in the first place there is no point trying to limit the number of loops. We are guaranteed that everything will fit in at some point. Currently the kernel never shrinks the bitmap, but I future-proofed the code to account for it. With this in mind:

int
fdwalk(int (*cb)(void *, int), void *cbd)
{
        int mib[4];
        size_t oldlen, newlen;
        int error, i, j, len;
        NDSLOTTYPE *buf, tmp;

        mib[0] = CTL_KERN;
        mib[1] = KERN_PROC;
        mib[2] = KERN_PROC_FDMAP;
        mib[3] = 0;

        oldlen = 0;
        for (;;) {
                error = sysctl(mib, nitems(mib), NULL, &newlen, NULL, 0);
                if (error == -1)
                        return (0);
                if (oldlen < newlen) {
                        oldlen = newlen;
                        buf = alloca(newlen);
                }
                newlen = oldlen;
                error = sysctl(mib, nitems(mib), buf, &newlen, NULL, 0);
                if (error == 0)
                        break;
                if (errno != ENOMEM)
                        return (0);
        }

        /*
         * Go through the full file list.  The fdmap is an integral multiple of
         * sizeof(NDSLOTTYPE).
         */
        len = howmany(newlen, sizeof(NDSLOTTYPE));

        for (i = 0; i < len; i++) {
                /*
                 * Iterate over each bit in the slot, short-circuting when there
                 * are no more file descriptors in use in this slot.
                 */
                for (j = 0, tmp = buf[i];
                        j < NBBY * sizeof(NDSLOTTYPE) && tmp != 0;
                        j++, tmp >>= 1) {
                        if (tmp & 1) {
                                error = cb(cbd, FD(i, j));
                                if (error != 0)
                                        return (error);
                        }
                }
        }
        return (0);
}

Note I still don't know if inclusion in libc is a good idea, I don't have a strong opinion one way or the other. But if putting this in libc we better make sure glib is using it.

In D21206#462415, @mjg wrote:

I had an extra look and there is an extremely worrisome comment: apparently the routine is expected to by async-signal safe. This means both malloc and realloc are off the table. This can be worked around with having a statically sized buffer and requesting the bitmap in chunks which fit the buffer, i.e. pass the buffer and an offset (size of the buffer * iteration counter). See https://gitlab.gnome.org/GNOME/glib/blob/master/glib/gspawn.c#L1182

I fdwalk(3) needs to be async-signal safe, it is enough to switch from malloc(3) to either mmap(2)/munmap(2) or alloca(3).

I used alloca. It may be wasteful on the stack for a very poorly behaving process but it should be fine. I don't think paying for mmap is justified.

@mjg glib does use fdwalk() if available, and emulates it where needed. This work I started specifically because I saw this same code flow in multiple ports.

sys/kern/kern_descrip.c
3905

D'oh, yes.

3909

This follows style of the rest of the file (copy&paste of another node). All other SYSCTL_NODE()s have this style violation. I will, however, fix this one.

rozhuk.im-gmail.com added inline comments.
lib/libc/stdlib/fdwalk.c
68

Why not sysctl() return ENOMEM if buf size is lowr than needed?

I create bug report for same behavior: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240573

@mjg glib does use fdwalk() if available, and emulates it where needed. This work I started specifically because I saw this same code flow in multiple ports.

It is true that GLib uses fdwalk when it exists, but it may be changed in the future because of safety concern: https://gitlab.gnome.org/GNOME/glib/merge_requests/574#note_403912. If we guarantee our fdwalk is safe for the use case of GLib, I think the man page should explicitly mention that it is safe to use fdwalk between fork and exec. If it is async-signal safe, fdwalk should be included in the list of safe functions in sigaction(2) man page.

A concern was also raised on the permission to debug a process: https://gitlab.gnome.org/GNOME/glib/issues/1638#note_400640. Is fdwalk going to work when debugging is disabled? If the answer is no, it is possible that GLib still have to switch to its slow fallback code when fdwalk provided by the system fails.

  1. If fdwalk lands in the tree, it will be an async-signal safe implementation.
  2. It will always work for the current process (i.e. when it requests the data for itself), in fact my suggested kernel interface only allows this mode.

Thanks for linking the discussion upstream. Is there any particular timeline where it would be beneficial to get this sorted out?

mjg added a reviewer: jhibbits.

Take over this revision after a talk with @jhibbits.

GNOME releases are made in March and September. It is usually better to make an important change at the beginning of a release cycle, so it can be tested by more people for longer.

GLib upstream has CI configured to run tests on FreeBSD regularly, but it isn't going to help testing this change because it only runs on releases. It currently runs 11.3 and 12.0 jails on 12.0 kernel.

mjg retitled this revision from libc: Add fdwalk(3), commonly used in Linux software to libc: add fdwalk.
  • add an async-signal-safe implementation
  • fix return value to always be 0

I'll have to do some testing with glib to make sure it properly picks it up etc., but it does work for my toy program and matches all expectations.

lib/libc/stdlib/fdwalk.c
57

I believe that the formal contract for alloca(3) is that the stack space is only freed on return from the function, not on the termination of the block scope. In other words, this loop allows for unbound stack allocation if raced.

lib/libc/stdlib/fdwalk.c
57

It can allocate multiple buffers, but each one is twice as big as the previous and the biggest you can allocate is loosely bounded by your fd limit. For more than two buffers to be allocated the code would effectively have to explicitly dup a fd outside of the previous range and keep doing it while winning the race for more buffers to be allocated.

In other words this can use more memory only if the application is racing with itself on purpose and even then there is an upper bound to it.

kib added inline comments.
lib/libc/stdlib/fdwalk.c
57

Add a comment note that this code relies on the kernel returning specific values for newlen.

This revision is now accepted and ready to land.Sep 17 2019, 2:18 PM

I wrote some tests based on my reading of historical documentation, and posted those for review in D26809. I note that the current release of glib20 in the ports tree does indeed have fdwalk() #if 0'd out, so we'd need to upstream a patch to make it use it for FreeBSD.

If you can confirm that glib's fdwalk also passes your tests I'll take a fresh look at this patch and commit.

In D21206#597622, @mjg wrote:

If you can confirm that glib's fdwalk also passes your tests I'll take a fresh look at this patch and commit.

As discussed, glib's implementation is internal and is suboptimal in that it doesn't preserve the properties that fdwalk has historically had (i.e. from Solaris). The FreeBSD fallback is not even close because we don't mount procfd by default, and the Linux variant is only a little more accurate -- it will open /proc/self/fd and iterate over those, but it invokes the callback as it goes which violates the basic design of the original
solaris fdwalk: make a list of all open fds, then iterate over it and invoke the callback.

I did go ahead and run my tests[0] against OpenIndiana, and the semantics do match there.

[0] https://people.freebsd.org/~kevans/fdwalk_test_portable.c

So one thing I was worried about here is a corner case where someone ends up with a BIGNUM descriptor, with a long list of empty fds in front of it. The way forward which would be slower than the proposed patch, but would avoid any corner cases would export an array of used fds in chunks. Somewhat similar to what's happening with readdir over /proc/pid/fd, except avoiding the majority of that overhead. I may end up hacking it up this month.

Besides, perhaps fdwalk would be a better fit for libutil?

Alternatively, looking at https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/gspawn.c#L1470 , there is F_NEXTFD in solaris and it should be trivial to implement.

Final edit: glib only uses fdwalk for closefrom (if not implemented) or close-on-exec setting in close_range (if not implemented). The former was available for years now and I implemented the latter few weeks back (f3f3e3c44d3b1776653bbf19eab17ce006a815d8) so this would not be used to begin with at least by glib.

I talked to solaris people, both F_PREVFD and F_NEXTFD are considered internal and don't have documented semantics. As such, they are off, unless someone wants a FreeBSD-specific variant, named differently.

I also did code search on github. fdwalk *is* used in some places, for example gdb, thus it makes to sense to implement at some point.

Just bumped into this while poking around glib20. What's preventing getting this in?

The code works but has a nasty corner case -- what if the process has only 2 fds opened, say 0 and 8364742 (or some other high number). Then you copy out out rather huge bitmap and force it to scan it.

I think least bad way out is a sysctl which accepts a range of fds the process is interested in and a buffer + size to fill it. Then you get back an array of fds which fit and can continue.

For example you would call it with range 0..INT_MAX, but only have space for a number of fds which wont fit everything you got there. Denoting highest fd you got as HIGHEST_FD, you would then redo the call with HIGHEST_FD+1..INT_MAX and so on until you get 0, meaning there are no more fds.

This should be trivial to hack up, does not not induce a syscall for every fd and does not have nasty corner cases. I can review, but can't be arsed to write.