Page MenuHomeFreeBSD

Optimize HPTS so that little work is done until we have a hpts thread that is over the connection threshold
ClosedPublic

Authored by rrs on Mar 19 2024, 10:18 AM.
Tags
None
Referenced Files
F108441767: D44420.id136253.diff
Fri, Jan 24, 7:26 PM
F108441659: D44420.id136305.diff
Fri, Jan 24, 7:25 PM
F108441627: D44420.id136038.diff
Fri, Jan 24, 7:25 PM
F108441625: D44420.id136037.diff
Fri, Jan 24, 7:25 PM
F108441624: D44420.id135992.diff
Fri, Jan 24, 7:25 PM
F108441621: D44420.id135932.diff
Fri, Jan 24, 7:25 PM
Unknown Object (File)
Tue, Jan 14, 1:04 PM
Unknown Object (File)
Sun, Jan 12, 1:43 PM

Details

Summary

HPTS inserts a softclock for system call return that optimizes performance. However when
no HPTS threads need the help (i.e. when they have less than 100 or so connections) then
there should be little work done i.e. check the counter and return instead of running through
all the threads getting locks etc.

Test Plan

Validate that we no longer do the direct_call unless we get over the set number (default = 100) of
connections on an individual HPTS

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

rrs requested review of this revision.Mar 19 2024, 10:18 AM

Do you think you could also add a 'bool __read_mostly hpts_userrret_hook = true;' controlled by a sysctl to avoid calling into htps at all from userret for folks that want to disable this entirely?

With the counter(9) the increment is very cheap, but the fetch is not cheap. The patch adds fetch on every userret() while doing inc/dec very occasionally. IMHO, counter(9) is a wrong instrument to do that. A straightforward atomic(9) for write and unlocked read on read would be better.

I can easily change to an atomic.. I did not realize the fetch was expensive :)

Let me do that

and yeah Drew we could add a block with sysctl

Updated to use an atomic instead of the u64_counter as Gleb suggests.

sys/netinet/tcp_hpts.c
343

I think 0 is correct here. Since the pointer is not NULL, the parameter after pointer is basically ignored.

sys/netinet/tcp_hpts.c
219

Would it make sense to use for this one bit in p_avail from line 196, since you use this a boolean?

1697

Would it make sense to pul the above code after line 516, where p_on_queue_cnt is incremented?

1701

Would it make sense to put the above lines after line 585, where p_on_queue_cnt is decremented?

This tests well. On my test system hw.model: Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz, I see a 50% reduction in syscall latency for the lmbench lat_syscall test (0.89us -> 0.44us) when the system is idle.

After putting load on the machine, I see net.inet.tcp.hpts.stats.direct_call: 131332474 (and increasing)

So I think this works as advertised.

sys/netinet/tcp_hpts.c
219

Yeah I had forgotten that I had a bit array already

343

Hmm is not the next parameter the "default value"?
And if so its set:

conn_cnt_thresh = DEFAULT_CONNECTION_THRESHOLD;

at line 245, so it does not default to 0, but 100.

1697

I wanted the place where we were exiting, and since we already are making comparisons
to this value in the next lines following this it made sense.

At the insert-internal point you are not through processing all the inp's that are "to be processed". I thought it best
that only after all are processed and the hpts is going back to "sleep" should we evaluate these conditions to see
if support is needed for sys-call assistance.

Adopt Michael's suggestion to re-use one of the available bits for the flag.

sys/netinet/tcp_hpts.c
343

Yes, the default comes from line 245.

man SYSCTL_INT says:

struct sysctl_oid *
SYSCTL_ADD_INT(struct sysctl_ctx_list *ctx,
    struct sysctl_oid_list *parent, int number, const char *name,
    int ctlflags, int *ptr, int val, const char *descr);

...

ptr       Pointer to sysctl variable or string data.  For sysctl values
          the pointer can be SYSCTL_NULL_XXX_PTR which means the OID is
          read-only and the returned value should be taken from the val
          argument.

val       If the ptr argument is SYSCTL_NULL_XXX_PTR, gives the constant
          value returned by this OID.  Else this argument is not used.
sys/netinet/tcp_hpts.c
1557

Can you move this test into the #define of tcp_hpts_softclock in systm.h ? That would avoid a function call.

sys/netinet/tcp_hpts.c
273

Can you declare this in subr_trap.c, so its on the same cacheline as the tcp_hpts_softclock function pointer?

sys/netinet/tcp_hpts.c
273

Sure that makes sense. It will require not being static but thats probably ok :)

1557

Sure

Adopt Drew's optimization suggestion and reverted back the sysctl Michael pointed out since the arg is unused (and here I thought
all these years it was the default).

I do believe the extern is not needed here since systm.h is included.

Why can't we just clear and set the function pointer itself? That would reduce impact on userret() down to just one instruction.

Why can't we just clear and set the function pointer itself? That would reduce impact on userret() down to just one instruction.

Gleb

I have thought about this a bit and there are some issues with it that I think in the end point to another way..

If we add

tcp_hpts_softclock = function

<and>

tcp_hpts_softclock = NULL

to the hpts changes around line 1687 it is true we could take away the check on hpts_that_need_softclock > 0

However I think that leads to a race condition since a thread might NULL the pointer
just after you do the

if (tcp_hpts_softclock != NULL)

and right before you do

tcp_hpts_softclock()

Which would lead to an occasional NULL de-ref :(

Now one could avoid that by doing something like

lc = tcp_hpts_softclock;
if (lc != NULL)

lc();

But then you are adding a var and a set of that variable. Which I think adds back the
one instruction you are trying to save.

Now an alternative I think is that you could reverse the test i.e.

if ((hpts_that_need_softclock > 0) && (tcp_hpts_sofrclock != NULL))

would yield, in the hpts_that_need_softclock =0 case, only the one comparison instruction
ran if the compiler is doing anything sensible :)

And if you want to get right down to it I think you could drop the tcp_hpts_softclock != NULL check.

The reason being

if you have 100 or more connections active on the hpts wheel of one thread then I suspect you will never
find that tcp_hpts_softclock == NULL. As long as no one else messes with hpts_that_need_softclock but hpts that is.

So the question becomes just a reversal of the condition (which saves an instruction in the 0 case) <or>
getting rid of the NULL check (which should I think work)

Thoughts??

I think the variable approach should actually end in less assembly than the existing one. Let me check that!
Now:

0xffffffff80cede5e <+254>:     mov    0xaff763(%rip),%rax        # 0xffffffff817ed5c8 <tcp_hpts_softclock>
0xffffffff80cede65 <+261>:     cmp    $0x0,%rax
0xffffffff80cede69 <+265>:     je     0xffffffff80cede78 <userret+280>
0xffffffff80cede6f <+271>:     mov    0xaff752(%rip),%rax        # 0xffffffff817ed5c8 <tcp_hpts_softclock>
0xffffffff80cede76 <+278>:     call   *%rax
0xffffffff80cede78 <+280>:     jmp    0xffffffff80cede7d <userret+285>

With variable:

101: 48 8b 05 00 00 00 00          movq    (%rip), %rax            # 0x108 <userret+0x108>
108: 48 89 45 c0                   movq    %rax, -0x40(%rbp)
10c: 48 83 7d c0 00                cmpq    $0x0, -0x40(%rbp)
111: 0f 84 03 00 00 00             je      0x11a <userret+0x11a>
117: ff 55 c0                      callq   *-0x40(%rbp)
11a: e9 00 00 00 00                jmp     0x11f <userret+0x11f>

Well, for some reason clang decided to put it on stack instead of executing straight from %rax, which I expected. I expected it to just drop the second load! So this ended in the same number of instructions, and (not sure) pretty much same number of cycles. I don't know if clang is right or wrong compiling it that way. Anyway I think the variable is the way to go.
Patch I used:

--- a/sys/sys/systm.h
+++ b/sys/sys/systm.h
@@ -401,8 +401,9 @@ extern int  cpu_disable_c3_sleep;
 
 extern void    (*tcp_hpts_softclock)(void);
 #define        tcp_hpts_softclock()    do {                                    \
-               if (tcp_hpts_softclock != NULL)                         \
-                       tcp_hpts_softclock();                           \
+               void (*hptsp)(void) = tcp_hpts_softclock;               \
+               if (hptsp != NULL)                                      \
+                       hptsp();                                        \
 } while (0)

Tried to add register keyword - clang compiled but with the same code. Well, I know that this keyword is ignored in the last 20 years, but I tried :)

Why not just use

if (hpts_that_need_softclock > 0)

tcp_hpts_softclock()\

?

Another issue we will need to address is multiple hpts threads trying to set the tcp_hpts_softclock pointer. One can envision
a case where you have one that is just lowering it to zero, and so goes into the "set" to NULL block. And another thread that
is going from 0 -> 1 and so goes into the set to function block.

The order of that could leave the pointer in an incorrect state, so we would need some sort of overall mutex for all the hpts threads
to lock and play with the softclock pointer...

Right now its a simple atomic...

I suspect the simplest is just the check > 0 and call it.. the only danger is some other code being introduced that does something to
hpts_that_need_softclock without assuring it has set tcp_hpts_softclock (i.e. non-hpts code since hpts does this right now always).

In D44420#1014587, @rrs wrote:

Why not just use

if (hpts_that_need_softclock > 0)

tcp_hpts_softclock()\

?

But that would create the same race you mentioned above, wouldn't it? IMHO, using function pointer variable is the way to go and it potentially has a chance to get reduced to fewer instructions if compiler gets smarter.

I don't think we really care about few spurious calls into htps when a system is on a brink between using and not using the hpts userret hack.

In D44420#1014587, @rrs wrote:

Why not just use

if (hpts_that_need_softclock > 0)

tcp_hpts_softclock()\

?

But that would create the same race you mentioned above, wouldn't it? IMHO, using function pointer variable is the way to go and it potentially has a chance to get reduced to fewer instructions if compiler gets smarter.

I don't see how you have a race. If the module load is setting the pointer *not* the setting of the 1/0 then there is no race. Basically
you are assured that the function pointer is set by the module load and if hpts has set hpts_that_need_softclock you *know* that
the function pointer is valid, there is no need to check.

You might want to add a KASSERT in the path where you do the atomic set of hpts_that_need_softclock, but other than that
I don't see that there is a race, what am I missing?

My bad, didn't understood proposed change. Of course if pointer stays stable since module load, there is no race.

My bad, didn't understood proposed change. Of course if pointer stays stable since module load, there is no race.

Cool I was thinking maybe I had missed something.

Let me make the change and update the diff.

R

Update to Gleb and I's agreed upon reduction at the if in system.h

I'm fine as long as Drew is fine with the change. Thanks!

gallatin added inline comments.
sys/sys/systm.h
403 ↗(On Diff #136226)

There is an attribute (__read_frequently) that I think you should add to both of these variables. Otherwise, it looks good to me.

sys/sys/systm.h
403 ↗(On Diff #136226)

Ok let me add that and update.

Add Drew's read_frequently compiler directive.

This revision is now accepted and ready to land.Mar 27 2024, 1:18 PM

Is it worth MFC to stable/14 before releng/14.1 gets branched?

Is it worth MFC to stable/14 before releng/14.1 gets branched?

Done, since I think it is worth doing it. Thanks for the reminder!