Details

Reviewers

markj
grehan
cc
gallatin
imp
olivier

Group Reviewers

Intel Networking
Restricted Owners Package	(Owns No Changed Paths)
pfsense

Commits

rGa527aa7a7f62: e1000: Re-add AIM
rG49f12d5b38f6: e1000: Re-add AIM
rG3e501ef89667: e1000: Re-add AIM

Summary

We originally left this out because iflib modulates interrupts and
accomplishes some level of batching versus the custom queues in the
older driver. Upon more detailed study of the Linux driver which has a
newer implementation, it finally became clear to me this is actually a
holdoff timer and not an interrupt limit as it is conventionally
(statically) programmed and displayed as an interrupt rate. The data
sheets also make this somewhat clear.

Thus, AIM accomplishes two beneficial things for a wide variety of
workloads[1]:

1. At low throughput/packet rates, it will significantly lower latency
(by counter-intuitively "increasing" the interrupt rate.. better
thought of as decreasing the holdoff timer because you will modulate
down before coming anywhere near these interrupt rates).
2. At bulk data rates, it is tuned to achieve a lower interrupt rate
(by increasing the holdoff timer) than the current static 8000/s. This
decreases processing overhead and yields more headroom for other work
such as packet filters or userland.

For a single NIC this might be worth a few sys% on common CPUs, but may
be meaningful when multiplied such as if_lagg, if_bridge and forwarding
setups.

The AIM algorithm was re-introduced from the older igb or out of tree
driver, and then modernized with permission to use Intel code from other
drivers.

I have retroactively added it to lem(4) and em(4) where the same concept
applies, albeit to a single ITR register.

[1]: http://iommu.com/datasheets/ethernet/controllers-nics/intel/e1000/gbe-controllers-interrupt-moderation-appl-note.pdf

Test Plan

Tested on a variety of em(4) and igb(4)

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kbowling created this revision.Sep 24 2024, 8:28 AM

Owners added a reviewer: Restricted Owners Package.Sep 24 2024, 8:28 AM

kbowling requested review of this revision.Sep 24 2024, 8:28 AM

kbowling added a reviewer: olivier.Sep 24 2024, 6:15 PM

Thanks for adding me as one of the reviewers. I will look at this patch and more likely test it in one of the machines in Emulab.

Rebase on main and some small improvements and bug fixes. Upon more testing the reimported algorithm is tuned for igb and less governed than intended on lem/em due to a different unit of measure on the ITR register. Need to think a little on how I would like to handle that.

In D46768#1067199, @kbowling wrote:

Rebase on main and some small improvements and bug fixes. Upon more testing the reimported algorithm is tuned for igb and less governed than intended on lem/em due to a different unit of measure on the ITR register. Need to think a little on how I would like to handle that.

OK. I will delay my test on em until your further progress. thanks

kbowling updated this revision to Diff 143862.Sep 28 2024, 11:27 AM

@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.

Existing static 8000 int/s behavior (how the driver is in main):

sysctl dev.igb.0.enable_aim=0

Suggested new default, you will boot in this mode with this patch:

sysctl dev.igb.0.enable_aim=1

Low latency option of above algorithm (up to 70k ints/s):

sysctl dev.igb.0.enable_aim=2

ixl(4) algorithm bodged in that would need to be cleaned up:

sysctl dev.igb.0.enable_aim=3

I would be curious to know what you find with these different options in an array of testing and I will use the results to ready this for actual use.

sys/dev/e1000/if_em.c
1512	This is a bodge just for testing purposes to see if is a reasonable starting point if the other algorithm cannot be obtained. Enable with: sysctl dev.igb.0.enable_aim=3
1552	@erj the tuning values here are Intel code, presumably we'd need permission to use it?

kbowling added inline comments.Sep 28 2024, 10:56 PM

sys/dev/e1000/if_em.c
1552	Authors: Auke Kok <auke-jan.h.kok@intel.com> Bruce Allan <bruce.w.allan@intel.com> Jesse Brandeburg <jesse.brandeburg@intel.com>

In D46768#1067607, @kbowling wrote:
@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.

Existing static 8000 int/s behavior (how the driver is in main):
sysctl dev.igb.0.enable_aim=0
Suggested new default, you will boot in this mode with this patch:
sysctl dev.igb.0.enable_aim=1
Low latency option of above algorithm (up to 70k ints/s):
sysctl dev.igb.0.enable_aim=2
ixl(4) algorithm bodged in that would need to be cleaned up:
sysctl dev.igb.0.enable_aim=3
I would be curious to know what you find with these different options in an array of testing and I will use the results to ready this for actual use.

I didn't find any rate change by the sysctl. Please let me know if the hardware does not support this new change.

root@s1:~ # sysctl dev.em.2.enable_aim=0
dev.em.2.enable_aim: 0 -> 0
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=1
dev.em.2.enable_aim: 0 -> 1
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=2
dev.em.2.enable_aim: 1 -> 2
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ #

root@s1:~ # ifconfig em2
em2: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
options=4e504bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
ether 00:04:23:b7:42:4e
inet 10.1.1.2 netmask 0xffffff00 broadcast 10.1.1.255
media: Ethernet 1000baseT <full-duplex>
status: active
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
root@s1:~ # pciconf -lv
...
em2@pci0:9:4:0: class=0x020000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x1010 subvendor=0x8086 subdevice=0x1012

vendor     = 'Intel Corporation'
device     = '82546EB Gigabit Ethernet Controller (Copper)'
class      = network
subclass   = ethernet

In D46768#1069015, @cc wrote:
In D46768#1067607, @kbowling wrote:
@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.

Existing static 8000 int/s behavior (how the driver is in main):
sysctl dev.igb.0.enable_aim=0
Suggested new default, you will boot in this mode with this patch:
sysctl dev.igb.0.enable_aim=1
Low latency option of above algorithm (up to 70k ints/s):
sysctl dev.igb.0.enable_aim=2
ixl(4) algorithm bodged in that would need to be cleaned up:
sysctl dev.igb.0.enable_aim=3
I would be curious to know what you find with these different options in an array of testing and I will use the results to ready this for actual use.
I didn't find any rate change by the sysctl. Please let me know if the hardware does not support this new change.

root@s1:~ # sysctl dev.em.2.enable_aim=0
dev.em.2.enable_aim: 0 -> 0
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=1
dev.em.2.enable_aim: 0 -> 1
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=2
dev.em.2.enable_aim: 1 -> 2
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ #

This looks to me like it is working, the algorithm is dynamic and 20k would be latency reducing idle queue. At enable_aim=0, you would see 8000. 20k looks right for an idle queue, what happens if you place a bulk load through it like iperf3? It should drop down to 4k.

In D46768#1069027, @kbowling wrote:
In D46768#1069015, @cc wrote:
In D46768#1067607, @kbowling wrote:
@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.

Existing static 8000 int/s behavior (how the driver is in main):
sysctl dev.igb.0.enable_aim=0
Suggested new default, you will boot in this mode with this patch:
sysctl dev.igb.0.enable_aim=1
Low latency option of above algorithm (up to 70k ints/s):
sysctl dev.igb.0.enable_aim=2
ixl(4) algorithm bodged in that would need to be cleaned up:
sysctl dev.igb.0.enable_aim=3
I would be curious to know what you find with these different options in an array of testing and I will use the results to ready this for actual use.
I didn't find any rate change by the sysctl. Please let me know if the hardware does not support this new change.

root@s1:~ # sysctl dev.em.2.enable_aim=0
dev.em.2.enable_aim: 0 -> 0
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=1
dev.em.2.enable_aim: 0 -> 1
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=2
dev.em.2.enable_aim: 1 -> 2
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ #
This looks to me like it is working, the algorithm is dynamic and 20k would be latency reducing idle queue. At enable_aim=0, you would see 8000. 20k looks right for an idle queue, what happens if you place a bulk load through it like iperf3? It should drop down to 4k.

I see. During the iperf traffic, I see interrupt rate 8k@enable_aim=0, 4k@enable_aim=1 or 2 dynamically. I see idle 20k@enable_aim=1 and 71k@enable_aim=2. However, none of the enable_aim=0 or 1 or 2 helps improve the iperf performance (570 Mbits/sec out of 1Gbps line rate) under loaded siftr module.

In D46768#1069038, @cc wrote:
In D46768#1069027, @kbowling wrote:
In D46768#1069015, @cc wrote:
In D46768#1067607, @kbowling wrote:
@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.

Existing static 8000 int/s behavior (how the driver is in main):
sysctl dev.igb.0.enable_aim=0
Suggested new default, you will boot in this mode with this patch:
sysctl dev.igb.0.enable_aim=1
Low latency option of above algorithm (up to 70k ints/s):
sysctl dev.igb.0.enable_aim=2
ixl(4) algorithm bodged in that would need to be cleaned up:
sysctl dev.igb.0.enable_aim=3
I would be curious to know what you find with these different options in an array of testing and I will use the results to ready this for actual use.
I didn't find any rate change by the sysctl. Please let me know if the hardware does not support this new change.

root@s1:~ # sysctl dev.em.2.enable_aim=0
dev.em.2.enable_aim: 0 -> 0
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=1
dev.em.2.enable_aim: 0 -> 1
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=2
dev.em.2.enable_aim: 1 -> 2
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ #
This looks to me like it is working, the algorithm is dynamic and 20k would be latency reducing idle queue. At enable_aim=0, you would see 8000. 20k looks right for an idle queue, what happens if you place a bulk load through it like iperf3? It should drop down to 4k.
I see. During the iperf traffic, I see interrupt rate 8k@enable_aim=0, 4k@enable_aim=1 or 2 dynamically. I see idle 20k@enable_aim=1 and 71k@enable_aim=2. However, none of the enable_aim=0 or 1 or 2 helps improve the iperf performance (570 Mbits/sec out of 1Gbps line rate) under loaded siftr module.

If I recall these machines are Pentium 4 era and pretty CPU constrained. You can try the tunable 'hw.em.unsupported_tso=1' and then enable TSO on the interface to get some more bulk bandwidth, they are stable with TSO.

Are you able to detect any improvements or regressions otherwise? ping-pong time at low packet rate between two systems both set with enable_aim=0,1,2 would be interesting.

If I recall these machines are Pentium 4 era and pretty CPU constrained. You can try the tunable 'hw.em.unsupported_tso=1' and then enable TSO on the interface to get some more bulk bandwidth, they are stable with TSO.

Are you able to detect any improvements or regressions otherwise? ping-pong time at low packet rate between two systems both set with enable_aim=0,1,2 would be interesting.

Here is my test result in my wiki page testD46768.

the round trip latency from ping shows significant improvement between enable_aim value 1 and 2
the bulk bandwidth using iperf3 under loaded siftr shows some insignificant improvement (+2.7%)
no regression found

I have no problem with this patch after testing it in Emulab. The test result is in my above comment.

kbowling added inline comments.Oct 9 2024, 9:59 PM

sys/dev/e1000/if_em.c
1552	Intel Networking ping. I need your help.

erj added inline comments.Oct 10 2024, 9:38 PM

sys/dev/e1000/if_em.c
1552	It looks like this algorithm originated from Intel so it's safe to include here as long as the Intel copyright is included in this file. Which it should've been in the first place...I'm not sure how/why it got removed.