HPTS inserts a softclock for system call return that optimizes performance. However when
no HPTS threads need the help (i.e. when they have less than 100 or so connections) then
there should be little work done i.e. check the counter and return instead of running through
all the threads getting locks etc.
Details
Validate that we no longer do the direct_call unless we get over the set number (default = 100) of
connections on an individual HPTS
Diff Detail
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
Do you think you could also add a 'bool __read_mostly hpts_userrret_hook = true;' controlled by a sysctl to avoid calling into htps at all from userret for folks that want to disable this entirely?
With the counter(9) the increment is very cheap, but the fetch is not cheap. The patch adds fetch on every userret() while doing inc/dec very occasionally. IMHO, counter(9) is a wrong instrument to do that. A straightforward atomic(9) for write and unlocked read on read would be better.
I can easily change to an atomic.. I did not realize the fetch was expensive :)
Let me do that
sys/netinet/tcp_hpts.c | ||
---|---|---|
340 | I think 0 is correct here. Since the pointer is not NULL, the parameter after pointer is basically ignored. |
sys/netinet/tcp_hpts.c | ||
---|---|---|
219 | Would it make sense to use for this one bit in p_avail from line 196, since you use this a boolean? | |
1694 | Would it make sense to pul the above code after line 516, where p_on_queue_cnt is incremented? | |
1698 | Would it make sense to put the above lines after line 585, where p_on_queue_cnt is decremented? |
This tests well. On my test system hw.model: Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz, I see a 50% reduction in syscall latency for the lmbench lat_syscall test (0.89us -> 0.44us) when the system is idle.
After putting load on the machine, I see net.inet.tcp.hpts.stats.direct_call: 131332474 (and increasing)
So I think this works as advertised.
sys/netinet/tcp_hpts.c | ||
---|---|---|
219 | Yeah I had forgotten that I had a bit array already | |
340 | Hmm is not the next parameter the "default value"? conn_cnt_thresh = DEFAULT_CONNECTION_THRESHOLD; at line 245, so it does not default to 0, but 100. | |
1694 | I wanted the place where we were exiting, and since we already are making comparisons At the insert-internal point you are not through processing all the inp's that are "to be processed". I thought it best |
sys/netinet/tcp_hpts.c | ||
---|---|---|
340 | Yes, the default comes from line 245. man SYSCTL_INT says: struct sysctl_oid * SYSCTL_ADD_INT(struct sysctl_ctx_list *ctx, struct sysctl_oid_list *parent, int number, const char *name, int ctlflags, int *ptr, int val, const char *descr); ... ptr Pointer to sysctl variable or string data. For sysctl values the pointer can be SYSCTL_NULL_XXX_PTR which means the OID is read-only and the returned value should be taken from the val argument. val If the ptr argument is SYSCTL_NULL_XXX_PTR, gives the constant value returned by this OID. Else this argument is not used. |
sys/netinet/tcp_hpts.c | ||
---|---|---|
1554 | Can you move this test into the #define of tcp_hpts_softclock in systm.h ? That would avoid a function call. |
sys/netinet/tcp_hpts.c | ||
---|---|---|
273 | Can you declare this in subr_trap.c, so its on the same cacheline as the tcp_hpts_softclock function pointer? |
Adopt Drew's optimization suggestion and reverted back the sysctl Michael pointed out since the arg is unused (and here I thought
all these years it was the default).
Why can't we just clear and set the function pointer itself? That would reduce impact on userret() down to just one instruction.
Gleb
I have thought about this a bit and there are some issues with it that I think in the end point to another way..
If we add
tcp_hpts_softclock = function
<and>
tcp_hpts_softclock = NULL
to the hpts changes around line 1687 it is true we could take away the check on hpts_that_need_softclock > 0
However I think that leads to a race condition since a thread might NULL the pointer
just after you do the
if (tcp_hpts_softclock != NULL)
and right before you do
tcp_hpts_softclock()
Which would lead to an occasional NULL de-ref :(
Now one could avoid that by doing something like
lc = tcp_hpts_softclock;
if (lc != NULL)
lc();
But then you are adding a var and a set of that variable. Which I think adds back the
one instruction you are trying to save.
Now an alternative I think is that you could reverse the test i.e.
if ((hpts_that_need_softclock > 0) && (tcp_hpts_sofrclock != NULL))
would yield, in the hpts_that_need_softclock =0 case, only the one comparison instruction
ran if the compiler is doing anything sensible :)
And if you want to get right down to it I think you could drop the tcp_hpts_softclock != NULL check.
The reason being
if you have 100 or more connections active on the hpts wheel of one thread then I suspect you will never
find that tcp_hpts_softclock == NULL. As long as no one else messes with hpts_that_need_softclock but hpts that is.
So the question becomes just a reversal of the condition (which saves an instruction in the 0 case) <or>
getting rid of the NULL check (which should I think work)
Thoughts??
I think the variable approach should actually end in less assembly than the existing one. Let me check that!
Now:
0xffffffff80cede5e <+254>: mov 0xaff763(%rip),%rax # 0xffffffff817ed5c8 <tcp_hpts_softclock> 0xffffffff80cede65 <+261>: cmp $0x0,%rax 0xffffffff80cede69 <+265>: je 0xffffffff80cede78 <userret+280> 0xffffffff80cede6f <+271>: mov 0xaff752(%rip),%rax # 0xffffffff817ed5c8 <tcp_hpts_softclock> 0xffffffff80cede76 <+278>: call *%rax 0xffffffff80cede78 <+280>: jmp 0xffffffff80cede7d <userret+285>
With variable:
101: 48 8b 05 00 00 00 00 movq (%rip), %rax # 0x108 <userret+0x108> 108: 48 89 45 c0 movq %rax, -0x40(%rbp) 10c: 48 83 7d c0 00 cmpq $0x0, -0x40(%rbp) 111: 0f 84 03 00 00 00 je 0x11a <userret+0x11a> 117: ff 55 c0 callq *-0x40(%rbp) 11a: e9 00 00 00 00 jmp 0x11f <userret+0x11f>
Well, for some reason clang decided to put it on stack instead of executing straight from %rax, which I expected. I expected it to just drop the second load! So this ended in the same number of instructions, and (not sure) pretty much same number of cycles. I don't know if clang is right or wrong compiling it that way. Anyway I think the variable is the way to go.
Patch I used:
--- a/sys/sys/systm.h +++ b/sys/sys/systm.h @@ -401,8 +401,9 @@ extern int cpu_disable_c3_sleep; extern void (*tcp_hpts_softclock)(void); #define tcp_hpts_softclock() do { \ - if (tcp_hpts_softclock != NULL) \ - tcp_hpts_softclock(); \ + void (*hptsp)(void) = tcp_hpts_softclock; \ + if (hptsp != NULL) \ + hptsp(); \ } while (0)
Tried to add register keyword - clang compiled but with the same code. Well, I know that this keyword is ignored in the last 20 years, but I tried :)
Another issue we will need to address is multiple hpts threads trying to set the tcp_hpts_softclock pointer. One can envision
a case where you have one that is just lowering it to zero, and so goes into the "set" to NULL block. And another thread that
is going from 0 -> 1 and so goes into the set to function block.
The order of that could leave the pointer in an incorrect state, so we would need some sort of overall mutex for all the hpts threads
to lock and play with the softclock pointer...
Right now its a simple atomic...
I suspect the simplest is just the check > 0 and call it.. the only danger is some other code being introduced that does something to
hpts_that_need_softclock without assuring it has set tcp_hpts_softclock (i.e. non-hpts code since hpts does this right now always).
But that would create the same race you mentioned above, wouldn't it? IMHO, using function pointer variable is the way to go and it potentially has a chance to get reduced to fewer instructions if compiler gets smarter.
I don't think we really care about few spurious calls into htps when a system is on a brink between using and not using the hpts userret hack.
I don't see how you have a race. If the module load is setting the pointer *not* the setting of the 1/0 then there is no race. Basically
you are assured that the function pointer is set by the module load and if hpts has set hpts_that_need_softclock you *know* that
the function pointer is valid, there is no need to check.
You might want to add a KASSERT in the path where you do the atomic set of hpts_that_need_softclock, but other than that
I don't see that there is a race, what am I missing?
My bad, didn't understood proposed change. Of course if pointer stays stable since module load, there is no race.
Cool I was thinking maybe I had missed something.
Let me make the change and update the diff.
R
sys/sys/systm.h | ||
---|---|---|
403 ↗ | (On Diff #136226) | There is an attribute (__read_frequently) that I think you should add to both of these variables. Otherwise, it looks good to me. |
sys/sys/systm.h | ||
---|---|---|
403 ↗ | (On Diff #136226) | Ok let me add that and update. |