We originally left this out because iflib modulates interrupts and accomplishes some level of batching versus the custom queues in the older driver. Upon more detailed study of the Linux driver which has a newer implementation, it finally became clear to me this is actually a holdoff timer and not an interrupt limit as it is conventionally (statically) programmed and displayed as an interrupt rate. The data sheets also make this somewhat clear. Thus, AIM accomplishes two beneficial things for a wide variety of workloads[1]: 1. At low throughput/packet rates, it will significantly lower latency (by counter-intuitively "increasing" the interrupt rate.. better thought of as decreasing the holdoff timer because you will modulate down before coming anywhere near these interrupt rates). 2. At bulk data rates, it is tuned to achieve a lower interrupt rate (by increasing the holdoff timer) than the current static 8000/s. This decreases processing overhead and yields more headroom for other work such as packet filters or userland. For a single NIC this might be worth a few sys% on common CPUs, but may be meaningful when multiplied such as if_lagg, if_bridge and forwarding setups. The AIM algorithm was re-introduced from the older igb or out of tree driver, and then modernized with permission to use Intel code from other drivers. I have retroactively added it to lem(4) and em(4) where the same concept applies, albeit to a single ITR register. [1]: http://iommu.com/datasheets/ethernet/controllers-nics/intel/e1000/gbe-controllers-interrupt-moderation-appl-note.pdf
Details
- Reviewers
markj grehan cc gallatin imp olivier - Group Reviewers
Intel Networking Restricted Owners Package (Owns No Changed Paths) pfsense - Commits
- rGa527aa7a7f62: e1000: Re-add AIM
rG49f12d5b38f6: e1000: Re-add AIM
rG3e501ef89667: e1000: Re-add AIM
Tested on a variety of em(4) and igb(4)
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
Thanks for adding me as one of the reviewers. I will look at this patch and more likely test it in one of the machines in Emulab.
Rebase on main and some small improvements and bug fixes. Upon more testing the reimported algorithm is tuned for igb and less governed than intended on lem/em due to a different unit of measure on the ITR register. Need to think a little on how I would like to handle that.
@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.
Existing static 8000 int/s behavior (how the driver is in main):
sysctl dev.igb.0.enable_aim=0
Suggested new default, you will boot in this mode with this patch:
sysctl dev.igb.0.enable_aim=1
Low latency option of above algorithm (up to 70k ints/s):
sysctl dev.igb.0.enable_aim=2
ixl(4) algorithm bodged in that would need to be cleaned up:
sysctl dev.igb.0.enable_aim=3
I would be curious to know what you find with these different options in an array of testing and I will use the results to ready this for actual use.
sys/dev/e1000/if_em.c | ||
---|---|---|
1510 | This is a bodge just for testing purposes to see if is a reasonable starting point if the other algorithm cannot be obtained. Enable with: sysctl dev.igb.0.enable_aim=3 | |
1550 | @erj the tuning values here are Intel code, presumably we'd need permission to use it? |
sys/dev/e1000/if_em.c | ||
---|---|---|
1550 | Authors: |
I didn't find any rate change by the sysctl. Please let me know if the hardware does not support this new change.
root@s1:~ # sysctl dev.em.2.enable_aim=0
dev.em.2.enable_aim: 0 -> 0
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=1
dev.em.2.enable_aim: 0 -> 1
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=2
dev.em.2.enable_aim: 1 -> 2
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ #
root@s1:~ # ifconfig em2
em2: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
options=4e504bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
ether 00:04:23:b7:42:4e
inet 10.1.1.2 netmask 0xffffff00 broadcast 10.1.1.255
media: Ethernet 1000baseT <full-duplex>
status: active
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
root@s1:~ # pciconf -lv
...
em2@pci0:9:4:0: class=0x020000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x1010 subvendor=0x8086 subdevice=0x1012
vendor = 'Intel Corporation' device = '82546EB Gigabit Ethernet Controller (Copper)' class = network subclass = ethernet
This looks to me like it is working, the algorithm is dynamic and 20k would be latency reducing idle queue. At enable_aim=0, you would see 8000. 20k looks right for an idle queue, what happens if you place a bulk load through it like iperf3? It should drop down to 4k.
I see. During the iperf traffic, I see interrupt rate 8k@enable_aim=0, 4k@enable_aim=1 or 2 dynamically. I see idle 20k@enable_aim=1 and 71k@enable_aim=2. However, none of the enable_aim=0 or 1 or 2 helps improve the iperf performance (570 Mbits/sec out of 1Gbps line rate) under loaded siftr module.
If I recall these machines are Pentium 4 era and pretty CPU constrained. You can try the tunable 'hw.em.unsupported_tso=1' and then enable TSO on the interface to get some more bulk bandwidth, they are stable with TSO.
Are you able to detect any improvements or regressions otherwise? ping-pong time at low packet rate between two systems both set with enable_aim=0,1,2 would be interesting.
If I recall these machines are Pentium 4 era and pretty CPU constrained. You can try the tunable 'hw.em.unsupported_tso=1' and then enable TSO on the interface to get some more bulk bandwidth, they are stable with TSO.
Are you able to detect any improvements or regressions otherwise? ping-pong time at low packet rate between two systems both set with enable_aim=0,1,2 would be interesting.
Here is my test result in my wiki page testD46768.
- the round trip latency from ping shows significant improvement between enable_aim value 1 and 2
- the bulk bandwidth using iperf3 under loaded siftr shows some insignificant improvement (+2.7%)
- no regression found
I have no problem with this patch after testing it in Emulab. The test result is in my above comment.
sys/dev/e1000/if_em.c | ||
---|---|---|
1550 | Intel Networking ping. I need your help. |
sys/dev/e1000/if_em.c | ||
---|---|---|
1550 | It looks like this algorithm originated from Intel so it's safe to include here as long as the Intel copyright is included in this file. Which it should've been in the first place...I'm not sure how/why it got removed. |