Page MenuHomeFreeBSD

if_epair: implement fanout
ClosedPublic

Authored by kp on Jan 3 2022, 8:45 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jan 6, 11:17 AM
Unknown Object (File)
Mon, Jan 6, 11:15 AM
Unknown Object (File)
Mon, Jan 6, 11:14 AM
Unknown Object (File)
Mon, Jan 6, 11:14 AM
Unknown Object (File)
Mon, Jan 6, 11:14 AM
Unknown Object (File)
Mon, Jan 6, 5:58 AM
Unknown Object (File)
Sun, Jan 5, 11:46 AM
Unknown Object (File)
Wed, Dec 25, 7:46 AM

Details

Summary

Allow multiple cores to be used to process if_epair traffic. We do this
(if RSS is enabled) based on the RSS hash of the incoming packet. This
allows us to distribute the load over multiple cores, rather than
sending everything to the same one.

We also switch from swi_sched() to taskqueues, which also contributes to
better throughput.

Benchmark results:
With net.isr.maxthreads=-1

Setup A: (cc0 - bridge0 - epair0a) (epair0b - bridge1 - cc1)

Before 627 Kpps
After (no RSS) 1.198 Mpps
After (RSS) 3.148 Mpps

Setup B: (cc0 - bridge0 - epaira0) (epair0b - vnet jail - epair1a) (epair1b - bridge1 - cc1)

Before 7.705 Kpps
After (no RSS) 1.017 Mpps
After (RSS) 2.083 Mpps

MFC after: 3 weeks
Sponsored by: Orange Business Services

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

kp requested review of this revision.Jan 3 2022, 8:45 PM

Where do your Setup B, After (RSS) numbers come from? Why are they worse than no RSS? That smells like something is bouncing around rather than sticking.

sys/net/if_epair.c
116

Do you still need this now?

sys/net/if_epair.c
551

This doesn't scale if you will have 500 epairs on a system.

Your probably really want to have a global set of tasks, CPU bound, and balance all epairs over those?

(a scenary your test cases do not consider at all)

573

Likewise...

In D33731#762671, @bz wrote:

Where do your Setup B, After (RSS) numbers come from? Why are they worse than no RSS? That smells like something is bouncing around rather than sticking.

That's the observed result. It's not clear to me why it's significantly worse than the simple setup, but your suspicion seems plausible.

sys/net/if_epair.c
116

No, that can go.

551

Right, so we'd create num_queues task queue threads, presumably from MOD_LOAD, and then assign tasks to them from individual epairs. We pass the epair_queue, so all relevant information should be present. I'll see if I can work up a patch to do that, and also how well that works.

sys/net/if_epair.c
551

For simplicity I'd just create a MIN(maxncpu, num_queues) threads but really for even more simplicity I'd just do either 1 or maxncpu and then have a epair queue per cpu; that means 500 queues will eventually fight for 1 task but a CPU can only do so much as it can and assuming RSS is working well the overall should balance out over all CPUs.

The former will get you back to where we have been and might be a sane default for people who run one vnet with low traffic or very few, on a low end system; on the other systems for crypto and all kinds of other things we do create a full set of threads per CPU which with 256(+) threads/cores/cpus will also eventually be interesting to see scale but that'll be someone else's problem to generally solve I assume.

Single set of taskqueues

In D33731#763409, @kp wrote:

Single set of taskqueues

Something like this?

This still has at least some issues, because enabling RSS doesn't improve (and may actually reduce) throughput. I'm not quite clear on why that would be though.

mjg added inline comments.
sys/net/if_epair.c
176–177

should be fcmpset and the atomic load hoisted out of the loop

808

You should walk all CPUs in order to obtain correct memory locality vs numa. I'm not aware of any sleep-friendly way to execute a callback on all CPUs, but you can just bind yourself like quiesce_cpus.

Thanks for the advice! I'm going to experiment with that.

In the mean time I've posted a slightly simplified version of this (basically this patch without RSS) because it somewhat improves the basic case, and vastly improves the pathological case. See D33853

I'll eventually rebase this patch on top of that, but need to work on these suggestions first.

.. if you're distributing workload with RSS and you set the numebr of RSS netisr contexts, that's where you can run your parallelism. The whole point here is to whack all the net processing in netisr contexts rather than keeping adding taskqueues and inventing new/fun ways to map cpus and figure out how many.

we still don't suitably autoconfigure netisr contexts based on how many cpus we have ... :P maybe we should finally fix that.

.. if you're distributing workload with RSS and you set the numebr of RSS netisr contexts, that's where you can run your parallelism. The whole point here is to whack all the net processing in netisr contexts rather than keeping adding taskqueues and inventing new/fun ways to map cpus and figure out how many.

we still don't suitably autoconfigure netisr contexts based on how many cpus we have ... :P maybe we should finally fix that.

epair in it's inital incarnation was able to use the netisr contexts and bind epairs to CPUs (still a problem on multi-socket). But the problem was that netisr wasn't keeping up. If you created an epair per CPU the system was long toast. Now the dimensions (the bindings) have changed given Kristof has implemented having a fanout over all CPUs so that even a single epair could use all of them and the balancing is different; but the end effect even with
{{{
net.isr.bindthreads=1
net.isr.maxthreads=256 # net.isr will always reduce it to mp_cpus
}}}
will be the same I bet.

kp edited the summary of this revision. (Show Details)EditedJan 17 2022, 9:47 PM

.. if you're distributing workload with RSS and you set the numebr of RSS netisr contexts, that's where you can run your parallelism. The whole point here is to whack all the net processing in netisr contexts rather than keeping adding taskqueues and inventing new/fun ways to map cpus and figure out how many.

we still don't suitably autoconfigure netisr contexts based on how many cpus we have ... :P maybe we should finally fix that.

That was a useful pointer, and with net.isr.maxthreads=-1 we get a bit more benefit from enabling RSS and attempting to spread the load.
I'm still not excited about these numbers, but they are significantly better than they used to be (spectacularly so for setup B), and both scenarios benefit from RSS now.

I've updated the commit message with my latest test results.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 15 2022, 8:04 AM
Closed by commit rG24f0bfbad57b: if_epair: implement fanout (authored by kp). · Explain Why
This revision was automatically updated to reflect the committed changes.