Page MenuHomeFreeBSD

aio: improve lock contention on the aio_job_mtx
AcceptedPublic

Authored by gallatin on Nov 11 2024, 7:14 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Jan 14, 6:04 AM
Unknown Object (File)
Mon, Jan 6, 1:38 PM
Unknown Object (File)
Sun, Jan 5, 10:16 PM
Unknown Object (File)
Sun, Jan 5, 6:34 PM
Unknown Object (File)
Sat, Jan 4, 11:23 PM
Unknown Object (File)
Wed, Dec 25, 5:54 PM
Unknown Object (File)
Dec 19 2024, 9:51 AM
Unknown Object (File)
Dec 18 2024, 12:52 AM

Details

Reviewers
kib
jhb
markj
imp
Summary

In some cases (nginx web server), a large number of processes using aio at the same time can put a lot of pressure on the aio job mutex. Work around this in 2 ways:

  • Take pressure off the aio_job mutex by managing num_aio_procs, refid and seqno using atomics. The existing code is a bit sloppy in managing num_aio_procs (count checked in parent, before forking child, then incremented in child), and I make no attempt to make it better. Due to this we can have at most 1 more proc per aio softc (which is 1 by default), which seems to be faithful to the current behavior
  • Allow creation of multiple aio software contexts. Eg, rather than a single aio job list, etc, shard that by process id to a larger number of aio contexts.

Note that since we hash aio to aio contexts via pid, multithreaded servers making heavy use of aio will still see the same contention. The decision to hash via pid is due to the process (not thread) struct keeping track of aio state, so hashing by pid was fairly straightforward and required no other modifications. And it "disappears" when the code uses a single aio context (which is the current default).

I've set the default number of context to 1, so this patch is mostly a noop. I was debating setting the default to "0", which autotunes based on number of cores, and provides the best scalability I was able to find for my workload. Interested in feedback on this, as the space tradeoff is fairly minimal, since the aio softc is tiny.

In a trivial benchmark using tools/regression/aio/aiop on a 32-core / 64-thread AMD server shows a roughly 6x speedup. The exact benchmark was 8 copies of "aiop $file 4096 1000000 255 ro "

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sys/kern/vfs_aio.c
316

Extra blank line.

318

Can this var be static?

326

I think that this pid->aio_sc idea is incompatible with rfork().

433

Extra blank line

1681–1682

The herald comment promised that the job* vars are wrapped on overflow. I think it matters at least on 32bit arches.

1953

This should be KASSERT, but usually we do not even bother asserting non-NULL, allowing the CPU hw to catch it.

gallatin marked 4 inline comments as done.

Address Kib's feedbackj

sys/kern/vfs_aio.c
326

I think I see what you mean. You're saying p_aioinfo could be shared among parent / child, thus breaking pid hashing.
I think it would be safe to stash the softc pointer in p_aioinfo, and use it rather than the pid after p_aioinfo is setup.
I assume it would be safe to access p->p_aioinfo without holding the proc lock, because aio_proc_rundown() currently removes and NULLs p_aioinfo without holding the proc lock.
I'm going to try something like this:

`

diff --git a/FreeBSD/sys/kern/vfs_aio.c b/FreeBSD/sys/kern/vfs_aio.c.  


index bfcc82970601f..e22b07284e735 100644
--- a/FreeBSD/sys/kern/vfs_aio.c
+++ b/FreeBSD/sys/kern/vfs_aio.c
@@ -282,6 +282,7 @@ struct kaioinfo {
        TAILQ_HEAD(,kaiocb) kaio_syncready;  /* (a) second q for aio_fsync */
        struct  task kaio_task;         /* (*) task to kick aio processes */
        struct  task kaio_sync_task;    /* (*) task to schedule fsync jobs */
+       struct  aio_softc *aio_sc;      /* (*) aio softc pointer*/
 };
 
 #define AIO_LOCK(ki)           mtx_lock(&(ki)->kaio_mtx)
@@ -321,6 +322,9 @@ aio_proc_to_softc(struct proc *p)
 {
        unsigned int idx;
 
+       if (p->p_aioinfo != NULL) {
+               return (p->p_aioinfo->aio_sc);
+       }
        idx = p->p_pid % num_aio_sc;
        return (&aio_scs[idx]);
 }
@@ -485,6 +489,7 @@ aio_init_aioinfo(struct proc *p)
        TAILQ_INIT(&ki->kaio_syncready);
        TASK_INIT(&ki->kaio_task, 0, aio_kick_helper, p);
        TASK_INIT(&ki->kaio_sync_task, 0, aio_schedule_fsync, ki);
+       ki->aio_sc = aio_proc_to_softc(p);
        PROC_LOCK(p);
        if (p->p_aioinfo == NULL) {
                p->p_aioinfo = ki;

`
1681–1682

I'm not sure that I understand what you mean. Doesn't fetchadd wrap the same way a simple addition would?

1953

Yes, left over from initial debugging. Sorry.

sys/kern/vfs_aio.c
326

p_aioinfo is zeroed on fork. Perhaps it should be moved into the inherited section of struct proc then.

I mean that rfork, and not fork, could cause problems, where address space is shared between parent and child.

1681–1682

I mean, that the comment suggests that the wraparound should skip 0 and wraps to 1. It is easy to implement with non-atomic arithmetic under the lock, but requires CAS loop for lock-less.

If the zero jobrefid and jobseqno are fine, then my note is irrelevant.

sys/kern/vfs_aio.c
1681–1682

Looking at the old code, its just a simple increment, with no special handling of 0 that I can see. Unless I'm missing something, I don't think the old code did this either.

sys/kern/vfs_aio.c
326

Looking at the code, I'm not sure how this matters. For new aio requests, a null p_aioinfo results in a new one being allocated. For checking status / completion of existing requests (aio_return, aio_error) , it appears that they already don't work in a child, due the null p_aioinfo.

So it seems like hashing by pid is well suited to aio, since the p_aioinfo status is per-process, and children don't inherit their parent's aio requests..

Can you please explain what I'm missing?

sys/kern/vfs_aio.c
326

I mean e.g. rfork(RGPROC|RFMEM). My belief is that such child thread must be able to do aio_return(2) and aio_cancel(2) on requests from any other thread sharing this address space.

Perhaps it is broken now.

1681–1682

Yes.

I said, 'if this requires fixing, using atomics makes the fix harder'.

sys/kern/vfs_aio.c
326

I just tested aio with fork and rfork, and in today's system (umodified -current from ~1 month ago), rfork(RFPROC|RFMEM) prevents aio from working. See attached transcript with a toy test program and output.

I honestly don't see how it could be any other way, given that p_aioinfo seems to be zeroed irrespective of the flags passed to do_fork(){F102547320}

kib added inline comments.
sys/kern/vfs_aio.c
326

The test program has restricted access.

As I noted in previous message, I sort of expected it to be broken. I plan to look at fixing this after your commit landed.

This revision is now accepted and ready to land.Nov 14 2024, 12:49 AM

I wonder if another way to think about the sharding might be to not assign a process to a shard, but hash on the (pid, fd) or some such to pick the shard to queue an individual job to. That might help even within the same process while still preserving the relative order of requests for a single file descriptor. The latter somewhat matters in that you can queue multiple aio_read() requests for a socket which complete in FIFO order, and for LIO_SYNC jobs that need to drain the jobs in front of them. This would also mean that management of the job pool would still kind of be global in that you need to ensure there's the minimum number of workers for each shard/context.

sys/kern/vfs_aio.c
501

How does this work with multiple contexts? When the first AIO job is queued, I believe it will create N aio_procs all tied to the context of the first user process. When subsequent user processes call this routine, num_aio_procs will be large enough, so no kprocs will be created for the new context. I feel like num_aio_procs actually needs to be per-context (and then you don't need to deal with atomics for it since all the modifications to it will be under the per-context lock anyway).

That said, MAX_AIO_PROCS and TARGET_AIO_PROCS are also rather silly in modern systems. Their initial value should be computed based on mp_ncpus not static constants of 32 and 4, respectively.

My guess is each context wants a target of 1 kproc and max of 4 kproc (or something like that). Combined with your mp_ncpu-scaled context count this scales the number of kprocs to the number of contexts. I would also include the context number in the kproc name, and probably make the aiod_unr per context as well so you end with kproc names of aiodX_Y where X is the context ID and Y is the specific kproc in that context.

1098

Needs a blank line before the comment, but we also normally declare types up at the top of the file as a group.

1216

It might be more intuitive if the argument to this function was the struct aio_softc *.

1681

The whole jid thing can just die. The aiocb_private.kernelinfo field is write-only. Nothing reads it anymore. We used to read it back in and try to find a matching aiocb, but now we walk the linked list looking for a matching userspace pointer instead. Its only use was removed in commit 1ce9182407f671538287837f0bbd8cc05bf41873.

I would maybe start with just removing jobrefid and all the store_kernelinfo stuff entirely as an initial commit.

The job sequence number field will not handle wrap correctly, but since it's 64-bit that's probably good enough and using atomics for it is ok.

To kib's point about dealing with rfork(), maybe instead of hashing on the (pid, fd) tuple you hash on the resolved struct file * pointer that is saved in job->fd.

Super helpful review, John. I just opened a new review (https://reviews.freebsd.org/D47583) for the simplest suggested change. Will work on your other suggestions.

sys/kern/vfs_aio.c
1681

Thanks. I just opened https://reviews.freebsd.org/D47583 for this.