iflib-enabled ixgbe driver update heavily influenced by sbruno and mmacy's work on D5213.
Details
- Reviewers
sbruno scottl kmacy jeffrey.e.pieper_intel.com shurd skozlov - Group Reviewers
Intel Networking - Commits
- rS327031: ixgbe(4): Convert driver to use iflib
Compile-tested with minimal touch-testing for the PF driver (none for VF). Respectfully request waiting for Jeff's team to perform a validation pass before committing.
Diff Detail
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 11819 Build 12162: arc lint + arc unit
Event Timeline
I might be missing something, but that was what I modified before our Thursday morning meeting. It should be using the isc_n[rt]xqsets_max members now.
Regarding the redefinition of CSUM_TCP/etc., I see what Matt was doing in terms of having these defines mean something more than just a renaming of some other macro. And I like the intention of this; it mimics what the CSUM_TSO macro does. However, I removed those undef/defines from ix_txrx.c in favor of explicitly using, for example, "(CSUM_IP_TCP | CSUM_IP6_TCP)". Those lines still fit in 80 columns. :)
I've tested 82599ES in single and LAGG configurations. This all seems good to me.
I'm moving on to test with X550-T2. I have no easy way to test VFs, so that'll need to be exercised.
Did you check RSS? Using the default 32 rx queues, rx_bytes and rx_packets only increment on queues 0-15 with 100 threads of TCP traffic.
About 30% performance drop regression (from 2.8Mpps to 1.8Mpps) during forwarding of smallest packet:
x r322489, packets-per-seconds + r322489 with D11727, packets-per-seconds +--------------------------------------------------------------------------+ | + | | + x | | ++ + xxx x| | |_A| | ||____M__A________| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 2828732.5 2886374.5 2856519 2855953.4 20817.191 + 5 1883367 2165891 1890597 1944843.5 123630.95 Difference at 95.0% confidence -911110 +/- 129292 -31.9021% +/- 4.49352% (Student's t, pooled s = 88650.9)
- Re-commit patches on top of commit from review D11712
- Move the stats updates to the admin update task.
- Removed setup_optics since it isn't useful code.
- Moved check for link from the timer context to the admin task.
- Changed iflib member for setting number of Tx/Rx queues in PF driver.
- Incorporated more of sbruno's suggestions for D11727 review.
- Removed SR-IOV code from the VF driver.
- Fixed dmac sysctl reporting negative value due to incompatible type.
- Adjusted other sysctl functions to use the correct handler function.
- Moved ixgbe_if_init() declaration into ixgbe_sriov.h since both if_ix.c and if_sriov.c include that header file and it's no longer declared for the VF driver that can't link to it anyway.
- Allow for VLAN filtering in VFs by re-adding capability flag.
- Add function prototype for ixgbe_tx_ctx_setup() to fix compilation error.
Move ixgbe_if_init() function declaration outside of #ifdef PCI_IOV block as PCI_IOV might be not defined when compiling driver.
- Rebased on top of D12496
- Prevent attach to complete if legacy interrupts are configured
on devices which do not support this mode.
I found pretty bad behavior with LRO enabled on this change.
In a two system configuration, both systems using 82599ES 10-Gigabit SFI/SFP+ back to back, lower number of connections (running iperf3) will go much slower. Disabling LRO (ifconfig ix0 -lro) on the receiver works around this problem at this time:
root@syssw07:~ # iperf3 -c 192.168.100.101 -P4 -i60 Connecting to host 192.168.100.101, port 5201 [ 5] local 192.168.100.100 port 21018 connected to 192.168.100.101 port 5201 [ 7] local 192.168.100.100 port 34156 connected to 192.168.100.101 port 5201 [ 9] local 192.168.100.100 port 42857 connected to 192.168.100.101 port 5201 [ 11] local 192.168.100.100 port 22922 connected to 192.168.100.101 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 313 MBytes 262 Mbits/sec 182 8.75 KBytes [ 7] 0.00-10.00 sec 346 MBytes 290 Mbits/sec 383 8.75 KBytes [ 9] 0.00-10.00 sec 211 MBytes 177 Mbits/sec 252 8.75 KBytes [ 11] 0.00-10.00 sec 318 MBytes 266 Mbits/sec 172 8.75 KBytes [SUM] 0.00-10.00 sec 1.16 GBytes 995 Mbits/sec 989 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 313 MBytes 262 Mbits/sec 182 sender [ 5] 0.00-10.00 sec 312 MBytes 262 Mbits/sec receiver [ 7] 0.00-10.00 sec 346 MBytes 290 Mbits/sec 383 sender [ 7] 0.00-10.00 sec 345 MBytes 290 Mbits/sec receiver [ 9] 0.00-10.00 sec 211 MBytes 177 Mbits/sec 252 sender [ 9] 0.00-10.00 sec 211 MBytes 177 Mbits/sec receiver [ 11] 0.00-10.00 sec 318 MBytes 266 Mbits/sec 172 sender [ 11] 0.00-10.00 sec 317 MBytes 266 Mbits/sec receiver [SUM] 0.00-10.00 sec 1.16 GBytes 995 Mbits/sec 989 sender [SUM] 0.00-10.00 sec 1.16 GBytes 994 Mbits/sec receiver iperf Done
VS an 8 connection test
root@syssw07:~ # iperf3 -c 192.168.100.101 -P8 -i60 Connecting to host 192.168.100.101, port 5201 [ 5] local 192.168.100.100 port 65287 connected to 192.168.100.101 port 5201 [ 7] local 192.168.100.100 port 51272 connected to 192.168.100.101 port 5201 [ 9] local 192.168.100.100 port 56056 connected to 192.168.100.101 port 5201 [ 11] local 192.168.100.100 port 10030 connected to 192.168.100.101 port 5201 [ 13] local 192.168.100.100 port 45967 connected to 192.168.100.101 port 5201 [ 15] local 192.168.100.100 port 56170 connected to 192.168.100.101 port 5201 [ 17] local 192.168.100.100 port 10231 connected to 192.168.100.101 port 5201 [ 19] local 192.168.100.100 port 33270 connected to 192.168.100.101 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 1.43 GBytes 1.23 Gbits/sec 0 1.78 MBytes [ 7] 0.00-10.00 sec 1.45 GBytes 1.24 Gbits/sec 0 1.77 MBytes [ 9] 0.00-10.00 sec 1.41 GBytes 1.21 Gbits/sec 11 1.09 MBytes [ 11] 0.00-10.00 sec 2.83 GBytes 2.43 Gbits/sec 22 1.34 MBytes [ 13] 0.00-10.00 sec 984 MBytes 826 Mbits/sec 0 1.78 MBytes [ 15] 0.00-10.00 sec 979 MBytes 821 Mbits/sec 0 1.78 MBytes [ 17] 0.00-10.00 sec 1.43 GBytes 1.23 Gbits/sec 0 1.77 MBytes [ 19] 0.00-10.00 sec 975 MBytes 818 Mbits/sec 0 1.78 MBytes [SUM] 0.00-10.00 sec 11.4 GBytes 9.81 Gbits/sec 33 - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.43 GBytes 1.23 Gbits/sec 0 sender [ 5] 0.00-10.01 sec 1.43 GBytes 1.23 Gbits/sec receiver [ 7] 0.00-10.00 sec 1.45 GBytes 1.24 Gbits/sec 0 sender [ 7] 0.00-10.01 sec 1.45 GBytes 1.24 Gbits/sec receiver [ 9] 0.00-10.00 sec 1.41 GBytes 1.21 Gbits/sec 11 sender [ 9] 0.00-10.01 sec 1.41 GBytes 1.21 Gbits/sec receiver [ 11] 0.00-10.00 sec 2.83 GBytes 2.43 Gbits/sec 22 sender [ 11] 0.00-10.01 sec 2.83 GBytes 2.43 Gbits/sec receiver [ 13] 0.00-10.00 sec 984 MBytes 826 Mbits/sec 0 sender [ 13] 0.00-10.01 sec 983 MBytes 824 Mbits/sec receiver [ 15] 0.00-10.00 sec 979 MBytes 821 Mbits/sec 0 sender [ 15] 0.00-10.01 sec 978 MBytes 820 Mbits/sec receiver [ 17] 0.00-10.00 sec 1.43 GBytes 1.23 Gbits/sec 0 sender [ 17] 0.00-10.01 sec 1.43 GBytes 1.22 Gbits/sec receiver [ 19] 0.00-10.00 sec 975 MBytes 818 Mbits/sec 0 sender [ 19] 0.00-10.01 sec 974 MBytes 816 Mbits/sec receiver [SUM] 0.00-10.00 sec 11.4 GBytes 9.81 Gbits/sec 33 sender [SUM] 0.00-10.01 sec 11.4 GBytes 9.80 Gbits/sec receiver iperf Done.
- Fix support for ifconfig's vlanhwtag flag.
Disabling this flag will now prevent the driver from stripping vlan tags from packets.
Ok, my report is due to the test systems disabling AIM in sysctl.conf.
At this point, AIM should be purged from the iflib version of ixgbe(4).
sys/dev/ixgbe/if_ix.c | ||
---|---|---|
3873 | Is this function even used now? I seem to see no references to it in code and the freebsd compile bails out. |
I've tried to update my previous benches, but on a recent head (r325618), and this review need to be updated because it no more apply (a small part was committed into head).
And compilation failed on this recent head too:
--- if_ixv.o --- /src/sys/dev/ixgbe/if_ixv.c:1916:1: warning: unused function 'ixv_set_sysctl_value' [-Wunused-functio n] ixv_set_sysctl_value(struct adapter *adapter, const char *name, (...) --- if_ix.o --- /src/sys/dev/ixgbe/if_ix.c:1850:28: error: no member named 'num_queues' in 'struct adapter' for (i = 0; i < adapter->num_queues; i++) { ~~~~~~~ ^ /src/sys/dev/ixgbe/if_ix.c:1851:20: error: no member named 'rx_rings' in 'struct adapter' rxr = &adapter->rx_rings[i]; ~~~~~~~ ^
- Remove set_sysctl_value(), AIM and operations on IXGBE_RXCSUM in VF
Remove ixgbe_set_sysctl_value() and AIM code from ixgbe (done by Sean Bruno).
Remove read and write operations on register IXGBE_RXCSUM in VF.
VF has no access to IXGBE_RXCSUM register. Read or write operations on this register were causing kernel panic.
Here are my updated benches "forwarding smallest packet size" results on 2 different hardware.
I'm using a fresh head (r325763) and the latest diff (35190) of this review.
First, on a 8 core Atom and Intel 82599ES 10-Gigabit
x head r325763: inet4 packets-per-second + head r325763 with D11727: inet4 packets-per-second +--------------------------------------------------------------------------+ | + | |+ + + + x x x x x| | |____M__A_______| | | |_____A__M__| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 2885856 2937099 2899727.5 2908672.7 23606.055 + 5 2728207 2768623 2757406 2748964 17584.618 Difference at 95.0% confidence -159709 +/- 30356.4 -5.49078% +/- 1.00717% (Student's t, pooled s = 20814.2)
Second, on a Dual CPU, Xeon_E5-2650 (12Cores), with Intel 82599ES 10Gigabit (using default 8 queues):
x head r325763: inet4 packets-per-second + head r325763 with D11727: inet4 packets-per-second +--------------------------------------------------------------------------+ |++ ++ + x xxx| | |AM|| | |_A_| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 4916319 5003793 4989624 4976294 36329.238 + 5 3244804 3392974 3312977 3314548.8 56867.158 Difference at 95.0% confidence -1.66175e+06 +/- 69591.5 -33.3932% +/- 1.28076% (Student's t, pooled s = 47716.3)
@olivier Can you post the sysctl/loader.conf settings you are using from both iterations of the test?
Sure, here is the /boot/loader.conf:
### Use next-gen MRSAS drivers in place of MFI for device supporting it # This solve lot's of [mfi] COMMAND 0x... TIMEOUT AFTER ## SECONDS hw.mfi.mrsas_enable="1" net.fibs="16" # Numbers of FIB hw.usb.no_pf="1" # Disable USB packet filtering # Disable HyperThreading (no benefit on a router) # https://bsdrp.net/documentation/technical_docs/performance#disabling_hyper_threading machdep.hyperthreading_allowed="0" # Don't limit the maximum of number of received packets to process at a time hw.ix.rx_process_limit="-1" # Allow unsupported SFP hw.ix.unsupported_sfp="1" hw.ix.allow_unsupported_sfp="1" # Avoid message netisr_register: epair requested queue limit 430080 capped to net.isr.maxqlimit 1024 net.isr.maxqlimit=430080
and /etc/sysctl.conf:
# Do not generate core file kern.coredump=0 #Power save: Disable power fo
And the specifics stuff into the /etc/rc.conf (TSO & LRO disabled):
ifconfig_ix0="inet 198.18.0.12/24 -tso4 -tso6 -lro -vlanhwtso" ifconfig_ix1="inet 198.19.0.12/24 -tso4 -tso6 -lro -vlanhwtso" # Disable INTERRUPT and NET_ETHER from entropy sources harvest_mask="351"
Do you want the flamegraph of these benches too ?
Hrm... is that two E5-2651s for 24 cores total, or two 6-core E5s for 12 cores total?
Yes: there are 2 CPU, with 12 cores each (HT is disabled), and by default the ixgbe drivers is using a maximum of 8 MSIX queues (I've kept the default because I don't want to fall into the NUMA trap).
Are you sure about the number of queues? For iflib version of driver the default number is based on number of cores. Are you setting dev.ix.X.override_n(r|t)xqs sysctls?
Let me check on my /var/run/dmesg:
FreeBSD/SMP: Multiprocessor System Detected: 24 CPUs FreeBSD/SMP: 2 package(s) x 12 core(s) x 2 hardware threads FreeBSD/SMP Online: 2 package(s) x 12 core(s) (...) ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver> port 0x2020-0x203f mem 0x91d00000-0x91dfffff,0x91e04000-0x91e07fff irq 44 at device 0.0 numa-domain 0 on pci5 ix0: using 2048 tx descriptors and 2048 rx descriptors ix0: msix_init qsets capped at 32 ix0: pxm cpus: 12 queue msgs: 63 admincnt: 1 ix0: using 24 rx queues 24 tx queues ix0: Using MSIX interrupts with 25 vectors ix0: allocated for 24 queues ix0: allocated for 24 rx queues ix0: Ethernet address: 24:6e:96:5b:92:80 ix0: PCI Express Bus: Speed 5.0GT/s Width x8 ix0: netmap queues/slots: TX 24/2048, RX 24/2048
And vmstat is agrea too:
[root@r630]~# vmstat -ai | grep ix0 irq288: ix0:rxq0 303008 12 irq289: ix0:rxq1 60786 2 irq290: ix0:rxq2 172906 7 irq291: ix0:rxq3 13460 1 irq292: ix0:rxq4 2479751 101 irq293: ix0:rxq5 2390323 97 irq294: ix0:rxq6 2030967 82 irq295: ix0:rxq7 1917268 78 irq296: ix0:rxq8 3642724 148 irq297: ix0:rxq9 3612677 146 irq298: ix0:rxq10 3625123 147 irq299: ix0:rxq11 3617915 147 irq300: ix0:rxq12 1363843 55 irq301: ix0:rxq13 1346608 55 irq302: ix0:rxq14 1760021 71 irq303: ix0:rxq15 2232016 90 irq304: ix0:rxq16 0 0 irq305: ix0:rxq17 0 0 irq306: ix0:rxq18 0 0 irq307: ix0:rxq19 0 0 irq308: ix0:rxq20 0 0 irq309: ix0:rxq21 0 0 irq310: ix0:rxq22 0 0 irq311: ix0:rxq23 0 0 irq312: ix0:aq 2 0
oops, you've right… it uses all the 24 cores :-(
Notice that I'm using 2000 UDP flows (20 sources IP and 100 destinations IP), and queues 16 to 23 seems not used.
Then should I force it to use only 12 queues for my use case for avoiding NUMA mess ?
We're seeing instability on X520-QDA1 (QSFP+):
- Occasionally, ifconfig will report "ifconfig: ix0: no media types?" message instead of the media type.
- System will "partially" hang on kldunload after the interface has been configured. After this occurs, you can ssh into the system through another interface and the FS is accessible, but ifconfig, top etc. are hung.
Created D13096 to fix the overallocation of queues. However, even with 24 queues, there should have been two per core (one per thread with HTT), not one per core across the two sockets.
with D13096 I've got only 12 queues assigned, no more NUMA mess :-)
ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver> port 0x2020-0x203f mem 0x91d00000-0x91dfffff,0x91e04000-0x91e07fff irq 44 at device 0.0 numa-domain 0 on pci5 ix0: using 2048 tx descriptors and 2048 rx descriptors ix0: msix_init qsets capped at 32 ix0: pxm cpus: 12 queue msgs: 63 admincnt: 1 ix0: using 12 rx queues 12 tx queues ix0: Using MSIX interrupts with 13 vectors ix0: allocated for 12 queues ix0: allocated for 12 rx queues ix0: Ethernet address: 24:6e:96:5b:92:80 ix0: PCI Express Bus: Speed 5.0GT/s Width x8 ix0: netmap queues/slots: TX 12/2048, RX 12/2048
But queue seems not bound to core (they jump to other cores, creating large standard derivation on the bench result), then I had to use an affinity script for best benefit:
[root@r630]~# service ix_affinity onestart Bind ix0 IRQ 288 to CPU 0 Bind ix0 IRQ 289 to CPU 1 Bind ix0 IRQ 290 to CPU 2 Bind ix0 IRQ 291 to CPU 3 Bind ix0 IRQ 292 to CPU 4 Bind ix0 IRQ 293 to CPU 5 Bind ix0 IRQ 294 to CPU 6 Bind ix0 IRQ 295 to CPU 7 Bind ix0 IRQ 296 to CPU 8 Bind ix0 IRQ 297 to CPU 9 Bind ix0 IRQ 298 to CPU 10 Bind ix0 IRQ 299 to CPU 11 Bind ix1 IRQ 301 to CPU 0 Bind ix1 IRQ 302 to CPU 1 Bind ix1 IRQ 303 to CPU 2 Bind ix1 IRQ 304 to CPU 3 Bind ix1 IRQ 305 to CPU 4 Bind ix1 IRQ 306 to CPU 5 Bind ix1 IRQ 307 to CPU 6 Bind ix1 IRQ 308 to CPU 7 Bind ix1 IRQ 309 to CPU 8 Bind ix1 IRQ 310 to CPU 9 Bind ix1 IRQ 311 to CPU 10 Bind ix1 IRQ 312 to CPU 11
The new bench results are now this one:
x head r325763: inet4 packets-per-second + head r325763 with D11727: inet4 packets-per-second * head r325763 with D11727 and D13096 : inet4 packets-per-second % head r325763 with D11727 and D13096 and irq/cpu affinity script : inet4 packets-per-second +--------------------------------------------------------------------------+ | % | |* ++++ % % %O ** *x x | | |A| | | |AM | | |__________________________A_________M_______________|| | |_____A_M__| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 4928969 5055535 4972662 4982311.4 46150.706 + 5 3294637 3409112 3362706 3355796 41123.949 Difference at 95.0% confidence -1.62652e+06 +/- 63748 -32.6458% +/- 1.06702% (Student's t, pooled s = 43709.6) * 5 2304760 4947464 4651496 4233081.8 1086987 No difference proven at 95.0% confidence % 5 4020134 4596013 4547350 4433036.8 238408.52 Difference at 95.0% confidence -549275 +/- 250429 -11.0245% +/- 5.00741% (Student's t, pooled s = 171710)
-> Only -11% of performance drop with D11727 + D13096 + affinity forced
with D13096 I've got only 12 queues assigned, no more NUMA mess :-)
ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver> port 0x2020-0x203f mem 0x91d00000-0x91dfffff,0x91e04000-0x91e07fff irq 44 at device 0.0 numa-domain 0 on pci5 ix0: using 2048 tx descriptors and 2048 rx descriptors ix0: msix_init qsets capped at 32 ix0: pxm cpus: 12 queue msgs: 63 admincnt: 1 ix0: using 12 rx queues 12 tx queues ix0: Using MSIX interrupts with 13 vectors ix0: allocated for 12 queues ix0: allocated for 12 rx queues ix0: Ethernet address: 24:6e:96:5b:92:80 ix0: PCI Express Bus: Speed 5.0GT/s Width x8 ix0: netmap queues/slots: TX 12/2048, RX 12/2048But queue seems not bound to core (they jump to other cores, creating large standard derivation on the bench result), then I had to use an affinity script for best benefit:
Hrm, I would have expected some affinity-related errors in the dmesg in this case. I've updated D12446 to HEAD, and it may help with the affinity issues (or it may log errors pointing to the issue).
Adding a temporary block until Monday to give us time to do live fire tests in production.