Page MenuHomeFreeBSD

ixgbe: Bring back accounting for tx in AIM
Needs ReviewPublic

Authored by kbowling on May 7 2021, 12:04 AM.
Tags
None
Referenced Files
F102013419: D30155.diff
Wed, Nov 6, 1:27 PM
Unknown Object (File)
Tue, Oct 15, 2:30 PM
Unknown Object (File)
Tue, Oct 15, 2:29 PM
Unknown Object (File)
Tue, Oct 15, 1:51 PM
Unknown Object (File)
Oct 5 2024, 10:59 AM
Unknown Object (File)
Oct 5 2024, 10:15 AM
Unknown Object (File)
Oct 2 2024, 7:25 PM
Unknown Object (File)
Oct 2 2024, 4:02 PM

Details

Summary

Testing AIM algorithms

Test Plan

Tested on two X552s

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sys/dev/ixgbe/if_ix.c
2394

Looking at ixgbe_isc_txd_encap(), it seems txr->packets never gets adjusted.

sys/dev/ixgbe/if_ix.c
2394

That is true. Thanks for bringing in Tx too. We need to make the change in ixgbe_isc_txd_encap() to include txr->packets = txr->total_packets;

Ok I'll add that accounting and report my findings

I have good news and bad news to report.

The good news is, AIM functions closer to intended on the sender heavy workload with the txr accounting in place.

The bad news: it occasionally lops off around 2gbps on a single stream TCP TSO sender with occasional packet loss in my test environment. On the receiver, AIM reduces single stream UDP performance by about 1gbps and increase loss 20%. That seems like a bigger issue than the current situation, and I'd rather just set static int ixgbe_max_interrupt_rate = (4000000 / IXGBE_LOW_LATENCY); to IXGBE_AVE_LATENCY as a break fix instead of enabling AIM while we continue to figure this EITR interaction out for the intel drivers.

From my perspective there are two worthwhile paths to investigate, in one we improve the AIM algorithm. In another, we figure out what is going on in iflib and make it work the way it's supposed to -- we have enough information on the sender we really shouldn't need to dynamically tune EITR as far as I can tell. I'm less sure about the receiver but think in the cases FreeBSD is used a correct static EITR value would be ok if we get the iflib re-arms correct. What do you think?

There are some optimizations in the iflib driver to decrease TX descriptor writeback txq_max_rs_deferred (I think @gallatin mentioned this earlier), I wonder if this is just a matter of the old AIM algorithm being too aggressive and needing to be tamped down a bit for this batching.

sys/dev/ixgbe/if_ix.c
2373

I would refrain using que->msix as queue array index. This may not work when SR-IOV is enabled.
Ideally we would want to use "txr.me" . But with Rx and Tx queue separation, I think we may have to introduce a new "index" variable to explicitly capture corresponding TxQ index for a given RxQ.

@kbowling ,

I have similar observation (bad news) wrt UDP. But for TCP, I see just fine. My runs are all on NetApp platform.
Please note: my client is not HoL.

client% sudo iperf3 -c 192.168.228.0 -i 5 -u -b 2G
Connecting to host 192.168.228.0, port 5201
[  4] local 192.168.227.254 port 24476 connected to 192.168.228.0 port 5201
[ ID] Interval           Transfer     Bitrate         Total Datagrams
[  4]   0.00-5.00   sec  1.16 GBytes  2.00 Gbits/sec  139506  
[  4]   5.00-10.00  sec  1.16 GBytes  2.00 Gbits/sec  139508  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Jitter    Lost/Total Datagrams
[  4]   0.00-10.00  sec  2.33 GBytes  2.00 Gbits/sec  0.000 ms  0/279014 (0%)  sender
[  4]   0.00-10.09  sec  1.61 GBytes  1.37 Gbits/sec  0.017 ms  85679/279012 (31%)  receiver

iperf Done.

Wrt TCP, I donot see your observation. My lab NIC is embedded-10G (X552).

client% sudo iperf3 -c 192.168.228.0 -i 5 -b 2G
Connecting to host 192.168.228.0, port 5201
[  4] local 192.168.227.254 port 38791 connected to 192.168.228.0 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  4]   0.00-5.00   sec  1.16 GBytes  2.00 Gbits/sec    1   4.33 MBytes       
[  4]   5.00-10.00  sec  1.16 GBytes  2.00 Gbits/sec    0   7.32 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  4]   0.00-10.00  sec  2.33 GBytes  2.00 Gbits/sec    1             sender
[  4]   0.00-10.00  sec  2.33 GBytes  2.00 Gbits/sec                  receiver

iperf Done.

client% sudo iperf3 -c 192.168.228.0 -i 5 -b 7G
Connecting to host 192.168.228.0, port 5201
[  4] local 192.168.227.254 port 38773 connected to 192.168.228.0 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  4]   0.00-5.00   sec  4.07 GBytes  7.00 Gbits/sec    0   3.74 MBytes
[  4]   5.00-10.00  sec  4.07 GBytes  7.00 Gbits/sec    0   3.74 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  4]   0.00-10.00  sec  8.15 GBytes  7.00 Gbits/sec    0             sender
[  4]   0.00-10.09  sec  8.15 GBytes  6.94 Gbits/sec                  receiver

iperf Done.

Also, I prefer to have a quick call and discuss the ideas & thoughts we have. We would need an expert from Intel to help us understand AIM.
On side note, from NetApp performance experiments on NetApp platforms, BSD11 (legacy driver) vs BSD12 (IFLIB base driver) - we noticed almost 3.5x-4x latency spike in one of write tests for IFLIB-based drivers.

@stallamr_netapp.com thanks, there is a variable here in that I am running in two VMs amongst other things. I'm also diving into this code for the first time in 3 years so this is new, I'm just trying to understand the problem in the drivers and hopefully fix it or find someone who can. @gnn is getting me access to the project's network lab, and I'll use that to see if I can take a look at the problem on other types of hardware.

I don't have any authority over intel but I agree it would be helpful if we could get them back on a regular call to discuss important networking development. Would you like me to send out a Google Calendar invite for an iflib meeting?

sys/dev/ixgbe/if_ix.c
2373

Ok I will think a bit harder on this, thanks for the feedback.

@stallamr_netapp.com are you still able to work on this? Netgate has been gracious to help me get over the finish line sponsoring this work, I just landed default and much improved AIM for e1000 and igc.

For now I've just updated the patch to a version that works as intended with the legacy FreeBSD ixgbe(4) AIM algorithm and allows easy inspection sysctl dev.ix | grep _rate and also supports runtime enable/disable sysctl dev.ix.0.enable_aim=0 to make testing easier.

There's a much improved ixgbe algorithm in Linux that we can potentially obtain from Intel. But I am also concerned that it seems to have a glaring bug treating 10gbit and 100mbit as the same divisor while the datasheets indicate the opposite.. so even using this as a baseline may take a bit of laborious validation work to get 100% correct. The reason this might be worthwhile is that empirically the legacy AIM algorithm doesn't seem very good, for instance on idle queues it is over-dampening the holdoff timer way too much whereas the code I imported for e1000/igc does what is desirable in significantly decreasing latency for idle queues and is likely related to the issue for small packet and UDP performance.

kbowling marked 2 inline comments as done.
kbowling edited the summary of this revision. (Show Details)
kbowling edited the test plan for this revision. (Show Details)
kbowling added reviewers: cc, gallatin, imp.
kbowling added subscribers: cc, imp.

Ok this is a bit messy code and comment wise but I have the new algorithm working in what I believe to be the correct way with some bug fixes versus the origin and would like some data to see how to proceed before tidying everything up.

@stallamr_netapp.com I would be interested to know if this patch works well for your workload.

@cc it looks like emulab has ix(4) on d820s nodes, would you be willing to take a look at these 3 options similar to the e1000 test?

  • Default in HEAD/STABLE: sysctl dev.ix.<N>.enable_aim=0
  • New algorithm (on by default with this patch) sysctl dev.ix.<N>.enable_aim=1
  • Old algorithm (FreeBSD <10) sysctl dev.ix.<N>.enable_aim=2

@imp @gallatin if you are able to test your workload, setting this to 1 and 2 would be new behavior versus where you are currently:

  • New algorithm (on by default with this patch) sysctl dev.ix.<N>.enable_aim=1
  • Old algorithm (FreeBSD <10) sysctl dev.ix.<N>.enable_aim=2
  • Default in HEAD/STABLE: sysctl dev.ix.<N>.enable_aim=0
sys/dev/ixgbe/if_ix.c
2336

@erj I think this is correct, although the datasheets don't specify 10m, 2.5g and 5g rates..

Minimum inter-interrupt interval specified in 2.048 s units at 1 GbE and 10 GbE
link. At 100 Mb/s, link the interval is specified in 20.48 s units.

I should probably submit a fix for the other driver if this works ok for us as they have 100, 1, 2.5, 5 rates wrong.

sys/dev/ixgbe/if_ix.c
2184

Why not move these defines into ixgbe.h?

2336

I would assume you're doing the right thing here, too. I don't think there are any different rates for 2.5g and 5g.

Which other driver are you talking about; the Linux driver?

kbowling added inline comments.
sys/dev/ixgbe/if_ix.c
2184

Yes certainly, this whole function is just a quick bodge to feel the algorithm out and I will clean it up if it works well for the various parties.

2336

Yes and thanks for taking a look.

@imp @gallatin if you are able to test your workload, setting this to 1 and 2 would be new behavior versus where you are currently:

I can pull this into our tree and make an image for @dhw to run on the A/B cluster. However, we're not using this hardware very much any more, and there is only 1 pair of machines using it in the A/B cluster. Lmk if you're still interested, and I'll try to build the image tomorrow so that David can test it at his leisure.

@imp @gallatin if you are able to test your workload, setting this to 1 and 2 would be new behavior versus where you are currently:

I can pull this into our tree and make an image for @dhw to run on the A/B cluster. However, we're not using this hardware very much any more, and there is only 1 pair of machines using it in the A/B cluster. Lmk if you're still interested, and I'll try to build the image tomorrow so that David can test it at his leisure.

Sure, it sounds like that is only enough for one experiment so I would focus on the default algorithm the patch will boot with sysctl dev.ix.<N>.enable_aim=1

Ok this is a bit messy code and comment wise but I have the new algorithm working in what I believe to be the correct way with some bug fixes versus the origin and would like some data to see how to proceed before tidying everything up.
@cc it looks like emulab has ix(4) on d820s nodes, would you be willing to take a look at these 3 options similar to the e1000 test?

  • Default in HEAD/STABLE: sysctl dev.ix.<N>.enable_aim=0
  • New algorithm (on by default with this patch) sysctl dev.ix.<N>.enable_aim=1
  • Old algorithm (FreeBSD <10) sysctl dev.ix.<N>.enable_aim=2

OK. Thanks for letting me know this patch. I will test it on d820s nodes in Emulab.
One question: why do you want to test it for FreeBSD releases < 10? Can I test it only in FreeBSD 15(CURRENT)?

If you have any additional test ideas, please also let me know.

In D30155#1074152, @cc wrote:

Ok this is a bit messy code and comment wise but I have the new algorithm working in what I believe to be the correct way with some bug fixes versus the origin and would like some data to see how to proceed before tidying everything up.
@cc it looks like emulab has ix(4) on d820s nodes, would you be willing to take a look at these 3 options similar to the e1000 test?

  • Default in HEAD/STABLE: sysctl dev.ix.<N>.enable_aim=0
  • New algorithm (on by default with this patch) sysctl dev.ix.<N>.enable_aim=1
  • Old algorithm (FreeBSD <10) sysctl dev.ix.<N>.enable_aim=2

OK. Thanks for letting me know this patch. I will test it on d820s nodes in Emulab.
One question: why do you want to test it for FreeBSD releases < 10? Can I test it only in FreeBSD 15(CURRENT)?

You only have to test one build with the patch and can toggle the behavior with the provided sysctls. enable_aim=2 is the older algorithm that used to be in FreeBSD.

If you have any additional test ideas, please also let me know.

@imp @gallatin if you are able to test your workload, setting this to 1 and 2 would be new behavior versus where you are currently:

I can pull this into our tree and make an image for @dhw to run on the A/B cluster. However, we're not using this hardware very much any more, and there is only 1 pair of machines using it in the A/B cluster. Lmk if you're still interested, and I'll try to build the image tomorrow so that David can test it at his leisure.

Sure, it sounds like that is only enough for one experiment so I would focus on the default algorithm the patch will boot with sysctl dev.ix.<N>.enable_aim=1

Its running now. Eyeballing command-line utilities, the CPU is about 5% higher (27% -> 32%) and we have 2x the irq rate (110k vs 55k irq/sec).
When applying this, I wanted to give it a fair shake, and disabled this tunable: hw.ix.max_interrupt_rate=4000. Perhaps that was a mistake? Is there a runtime way to tweak the algorithm so it doesn't interrupt so fast under this level of load?

FWIW, aim:
`c023.sjc003.dev# nstat 10

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
0.07   0.58   0.07   6.96  0   2923   32.43  23141 160546 108890  0.74

`
And control:
`c024.sjc003.dev# nstat 10

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
0.09   0.68   0.14   8.18  0   3675   26.13  27043 116752  54672  0.74`

I'll look at the a/b results tomorrow to get more comprehensive numbers

Drew

@imp @gallatin if you are able to test your workload, setting this to 1 and 2 would be new behavior versus where you are currently:

I can pull this into our tree and make an image for @dhw to run on the A/B cluster. However, we're not using this hardware very much any more, and there is only 1 pair of machines using it in the A/B cluster. Lmk if you're still interested, and I'll try to build the image tomorrow so that David can test it at his leisure.

Sure, it sounds like that is only enough for one experiment so I would focus on the default algorithm the patch will boot with sysctl dev.ix.<N>.enable_aim=1

Its running now. Eyeballing command-line utilities, the CPU is about 5% higher (27% -> 32%) and we have 2x the irq rate (110k vs 55k irq/sec).
When applying this, I wanted to give it a fair shake, and disabled this tunable: hw.ix.max_interrupt_rate=4000. Perhaps that was a mistake? Is there a runtime way to tweak the algorithm so it doesn't interrupt so fast under this level of load?

Ahh yeah it is expected to stabilize around 8000-9000k/s per queue at bulk data rates with this algorithm. Without your custom limit it would be hitting over 31.25k per queue so it is an improvement over the current defaults but not as efficient as your custom tuning for this workload.

If you are able to do well with that much delay I think AIM is not the right thing for you, instead would want to set hw.ix.enable_aim=0 in addition to your current tunable to keep your behavior if this review ends up landing for default use cases.

FWIW, aim:
`c023.sjc003.dev# nstat 10

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
0.07   0.58   0.07   6.96  0   2923   32.43  23141 160546 108890  0.74

`
And control:
`c024.sjc003.dev# nstat 10

InMpps OMpps  InGbs  OGbs err TCP Est %CPU syscalls csw     irq GBfree
0.09   0.68   0.14   8.18  0   3675   26.13  27043 116752  54672  0.74`

I'll look at the a/b results tomorrow to get more comprehensive numbers

I do appreciate your test as it can indicate it is neither outright broken nor terrible. Unless it would for some reason be interesting to you to see how the 31.25k default is versus this proposed new default I think you have enough to conclude it doesn't help you.

Drew

Yeah, my ideal irq rate/queue is < 1000 . We mostly use Chelsio and Mellanox NICs that can do super aggressive irq coalescing without freaking out TCP due to using RX timestamps. Super aggressive coalescing like this lets us build packet trains in excess of 1000 packets to feed to lro via RSS assisted LRO, and we actually have useful LRO on internet workloads with tens of thousands of TCP connections per-queue. That reminds me that I should port RSS assisted LRO to iflib (eg, lro_queue_mbuf()).

The a/b results were not surprising (boring as David likes to say). Just slightly higher CPU on the canary (due to the increased irq rate). But no clear streaming quality changes.
All in all, it seems to work and do no real harm, but we'll not use it due to the increased CPU

The a/b results were not surprising (boring as David likes to say). Just slightly higher CPU on the canary (due to the increased irq rate). But no clear streaming quality changes.
All in all, it seems to work and do no real harm, but we'll not use it due to the increased CPU

Thanks Drew, this is still to me good news in that it is explainable and expected. I will wait to see if the other parties can confirm or deny benefit before proceeding in any direction.

From my test result in testD30155, I didn't find any significant improvement under my eyes:

  • no significant difference in ping latency
  • no significant iperf3 performance improvement due to bad performance (3.x Gbps) in FreeBSD 15-current vs. (9.x Gbps) in stock Linux kernel 5.15.
In D30155#1075233, @cc wrote:

From my test result in testD30155, I didn't find any significant improvement under my eyes:

  • no significant difference in ping latency
  • no significant iperf3 performance improvement due to bad performance (3.x Gbps) in FreeBSD 15-current vs. (9.x Gbps) in stock Linux kernel 5.15.

Thanks for the results @cc. Something seems very strange with the throughput there, the main system I am testing is a xeon-d that is much less than 1/4th as powerful and can line rate both directions no issues and I also have an older 2x Xeon E5-2695 v2 (two NUMA domains) without throughput limitations. I will see if I can find my emulab credentials and take a look there, it seems like these might be 4-way NUMA machines but it is not expected to me that that would cause this magnitude of throughput issues, especially at the 10gbit data rate.

I am generally happy with the results so far and will clean the diff up for intel to glance at before proceeding further. @stallamr_netapp.com I am hoping we hear something from NetApp since you had the original issue and hopefully this fixes it upstream.

In D30155#1075233, @cc wrote:

From my test result in testD30155, I didn't find any significant improvement under my eyes:

  • no significant difference in ping latency
  • no significant iperf3 performance improvement due to bad performance (3.x Gbps) in FreeBSD 15-current vs. (9.x Gbps) in stock Linux kernel 5.15.

Thanks for the results @cc. Something seems very strange with the throughput there, the main system I am testing is a xeon-d that is much less than 1/4th as powerful and can line rate both directions no issues and I also have an older 2x Xeon E5-2695 v2 (two NUMA domains) without throughput limitations. I will see if I can find my emulab credentials and take a look there, it seems like these might be 4-way NUMA machines but it is not expected to me that that would cause this magnitude of throughput issues, especially at the 10gbit data rate.

Let me know in email if you need any help on the Emulab topic. thanks