Page MenuHomeFreeBSD

e1000: Re-add AIM
ClosedPublic

Authored by kbowling on Sep 24 2024, 8:28 AM.
Tags
None
Referenced Files
F107593055: D46768.diff
Thu, Jan 16, 8:56 AM
Unknown Object (File)
Sun, Jan 12, 4:35 AM
Unknown Object (File)
Fri, Jan 10, 1:25 AM
Unknown Object (File)
Thu, Jan 9, 1:22 AM
Unknown Object (File)
Sun, Dec 29, 6:27 PM
Unknown Object (File)
Dec 13 2024, 3:27 AM
Unknown Object (File)
Dec 11 2024, 9:23 PM
Unknown Object (File)
Nov 23 2024, 2:41 PM
Subscribers

Details

Summary
We originally left this out because iflib modulates interrupts and
accomplishes some level of batching versus the custom queues in the
older driver. Upon more detailed study of the Linux driver which has a
newer implementation, it finally became clear to me this is actually a
holdoff timer and not an interrupt limit as it is conventionally
(statically) programmed and displayed as an interrupt rate. The data
sheets also make this somewhat clear.

Thus, AIM accomplishes two beneficial things for a wide variety of
workloads[1]:

1. At low throughput/packet rates, it will significantly lower latency
(by counter-intuitively "increasing" the interrupt rate.. better
thought of as decreasing the holdoff timer because you will modulate
down before coming anywhere near these interrupt rates).
2. At bulk data rates, it is tuned to achieve a lower interrupt rate
(by increasing the holdoff timer) than the current static 8000/s. This
decreases processing overhead and yields more headroom for other work
such as packet filters or userland.

For a single NIC this might be worth a few sys% on common CPUs, but may
be meaningful when multiplied such as if_lagg, if_bridge and forwarding
setups.

The AIM algorithm was re-introduced from the older igb or out of tree
driver, and then modernized with permission to use Intel code from other
drivers.

I have retroactively added it to lem(4) and em(4) where the same concept
applies, albeit to a single ITR register.

[1]: http://iommu.com/datasheets/ethernet/controllers-nics/intel/e1000/gbe-controllers-interrupt-moderation-appl-note.pdf
Test Plan

Tested on a variety of em(4) and igb(4)

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

Owners added a reviewer: Restricted Owners Package.Sep 24 2024, 8:28 AM

Thanks for adding me as one of the reviewers. I will look at this patch and more likely test it in one of the machines in Emulab.

Rebase on main and some small improvements and bug fixes. Upon more testing the reimported algorithm is tuned for igb and less governed than intended on lem/em due to a different unit of measure on the ITR register. Need to think a little on how I would like to handle that.

Rebase on main and some small improvements and bug fixes. Upon more testing the reimported algorithm is tuned for igb and less governed than intended on lem/em due to a different unit of measure on the ITR register. Need to think a little on how I would like to handle that.

OK. I will delay my test on em until your further progress. thanks

@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.

Existing static 8000 int/s behavior (how the driver is in main):

sysctl dev.igb.0.enable_aim=0

Suggested new default, you will boot in this mode with this patch:

sysctl dev.igb.0.enable_aim=1

Low latency option of above algorithm (up to 70k ints/s):

sysctl dev.igb.0.enable_aim=2

ixl(4) algorithm bodged in that would need to be cleaned up:

sysctl dev.igb.0.enable_aim=3

I would be curious to know what you find with these different options in an array of testing and I will use the results to ready this for actual use.

sys/dev/e1000/if_em.c
1510

This is a bodge just for testing purposes to see if is a reasonable starting point if the other algorithm cannot be obtained.

Enable with:

sysctl dev.igb.0.enable_aim=3
1550

@erj the tuning values here are Intel code, presumably we'd need permission to use it?

sys/dev/e1000/if_em.c
1550

Authors:
Auke Kok <auke-jan.h.kok@intel.com>
Bruce Allan <bruce.w.allan@intel.com>
Jesse Brandeburg <jesse.brandeburg@intel.com>

@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.

Existing static 8000 int/s behavior (how the driver is in main):

sysctl dev.igb.0.enable_aim=0

Suggested new default, you will boot in this mode with this patch:

sysctl dev.igb.0.enable_aim=1

Low latency option of above algorithm (up to 70k ints/s):

sysctl dev.igb.0.enable_aim=2

ixl(4) algorithm bodged in that would need to be cleaned up:

sysctl dev.igb.0.enable_aim=3

I would be curious to know what you find with these different options in an array of testing and I will use the results to ready this for actual use.

I didn't find any rate change by the sysctl. Please let me know if the hardware does not support this new change.

root@s1:~ # sysctl dev.em.2.enable_aim=0
dev.em.2.enable_aim: 0 -> 0
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=1
dev.em.2.enable_aim: 0 -> 1
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=2
dev.em.2.enable_aim: 1 -> 2
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ #

root@s1:~ # ifconfig em2
em2: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
options=4e504bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LRO,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
ether 00:04:23:b7:42:4e
inet 10.1.1.2 netmask 0xffffff00 broadcast 10.1.1.255
media: Ethernet 1000baseT <full-duplex>
status: active
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
root@s1:~ # pciconf -lv
...
em2@pci0:9:4:0: class=0x020000 rev=0x01 hdr=0x00 vendor=0x8086 device=0x1010 subvendor=0x8086 subdevice=0x1012

vendor     = 'Intel Corporation'
device     = '82546EB Gigabit Ethernet Controller (Copper)'
class      = network
subclass   = ethernet
In D46768#1069015, @cc wrote:

@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.

Existing static 8000 int/s behavior (how the driver is in main):

sysctl dev.igb.0.enable_aim=0

Suggested new default, you will boot in this mode with this patch:

sysctl dev.igb.0.enable_aim=1

Low latency option of above algorithm (up to 70k ints/s):

sysctl dev.igb.0.enable_aim=2

ixl(4) algorithm bodged in that would need to be cleaned up:

sysctl dev.igb.0.enable_aim=3

I would be curious to know what you find with these different options in an array of testing and I will use the results to ready this for actual use.

I didn't find any rate change by the sysctl. Please let me know if the hardware does not support this new change.

root@s1:~ # sysctl dev.em.2.enable_aim=0
dev.em.2.enable_aim: 0 -> 0
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=1
dev.em.2.enable_aim: 0 -> 1
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=2
dev.em.2.enable_aim: 1 -> 2
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ #

This looks to me like it is working, the algorithm is dynamic and 20k would be latency reducing idle queue. At enable_aim=0, you would see 8000. 20k looks right for an idle queue, what happens if you place a bulk load through it like iperf3? It should drop down to 4k.

In D46768#1069015, @cc wrote:

@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.

Existing static 8000 int/s behavior (how the driver is in main):

sysctl dev.igb.0.enable_aim=0

Suggested new default, you will boot in this mode with this patch:

sysctl dev.igb.0.enable_aim=1

Low latency option of above algorithm (up to 70k ints/s):

sysctl dev.igb.0.enable_aim=2

ixl(4) algorithm bodged in that would need to be cleaned up:

sysctl dev.igb.0.enable_aim=3

I would be curious to know what you find with these different options in an array of testing and I will use the results to ready this for actual use.

I didn't find any rate change by the sysctl. Please let me know if the hardware does not support this new change.

root@s1:~ # sysctl dev.em.2.enable_aim=0
dev.em.2.enable_aim: 0 -> 0
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=1
dev.em.2.enable_aim: 0 -> 1
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=2
dev.em.2.enable_aim: 1 -> 2
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ #

This looks to me like it is working, the algorithm is dynamic and 20k would be latency reducing idle queue. At enable_aim=0, you would see 8000. 20k looks right for an idle queue, what happens if you place a bulk load through it like iperf3? It should drop down to 4k.

I see. During the iperf traffic, I see interrupt rate 8k@enable_aim=0, 4k@enable_aim=1 or 2 dynamically. I see idle 20k@enable_aim=1 and 71k@enable_aim=2. However, none of the enable_aim=0 or 1 or 2 helps improve the iperf performance (570 Mbits/sec out of 1Gbps line rate) under loaded siftr module.

In D46768#1069038, @cc wrote:
In D46768#1069015, @cc wrote:

@cc this code works well in my testing. There are now some quality of life improvements, at runtime you can now switch in the middle of a test. I run a tmux session with three splits, one of systat -vmsat, one of the benchmark (iperf3 or whatever), and one to either toggle sysctl dev.{em,igb}.<interface number>.enable_aim=<N> where <N> description which follows. You can also do something like sysctl dev.igb.0 | grep _rate to see the current queue values.

Existing static 8000 int/s behavior (how the driver is in main):

sysctl dev.igb.0.enable_aim=0

Suggested new default, you will boot in this mode with this patch:

sysctl dev.igb.0.enable_aim=1

Low latency option of above algorithm (up to 70k ints/s):

sysctl dev.igb.0.enable_aim=2

ixl(4) algorithm bodged in that would need to be cleaned up:

sysctl dev.igb.0.enable_aim=3

I would be curious to know what you find with these different options in an array of testing and I will use the results to ready this for actual use.

I didn't find any rate change by the sysctl. Please let me know if the hardware does not support this new change.

root@s1:~ # sysctl dev.em.2.enable_aim=0
dev.em.2.enable_aim: 0 -> 0
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=1
dev.em.2.enable_aim: 0 -> 1
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ # sysctl dev.em.2.enable_aim=2
dev.em.2.enable_aim: 1 -> 2
root@s1:~ # sysctl dev.em.2 | grep _rate
dev.em.2.queue_rx_0.interrupt_rate: 20032
dev.em.2.queue_tx_0.interrupt_rate: 20032
root@s1:~ #

This looks to me like it is working, the algorithm is dynamic and 20k would be latency reducing idle queue. At enable_aim=0, you would see 8000. 20k looks right for an idle queue, what happens if you place a bulk load through it like iperf3? It should drop down to 4k.

I see. During the iperf traffic, I see interrupt rate 8k@enable_aim=0, 4k@enable_aim=1 or 2 dynamically. I see idle 20k@enable_aim=1 and 71k@enable_aim=2. However, none of the enable_aim=0 or 1 or 2 helps improve the iperf performance (570 Mbits/sec out of 1Gbps line rate) under loaded siftr module.

If I recall these machines are Pentium 4 era and pretty CPU constrained. You can try the tunable 'hw.em.unsupported_tso=1' and then enable TSO on the interface to get some more bulk bandwidth, they are stable with TSO.

Are you able to detect any improvements or regressions otherwise? ping-pong time at low packet rate between two systems both set with enable_aim=0,1,2 would be interesting.

If I recall these machines are Pentium 4 era and pretty CPU constrained. You can try the tunable 'hw.em.unsupported_tso=1' and then enable TSO on the interface to get some more bulk bandwidth, they are stable with TSO.

Are you able to detect any improvements or regressions otherwise? ping-pong time at low packet rate between two systems both set with enable_aim=0,1,2 would be interesting.

Here is my test result in my wiki page testD46768.

  • the round trip latency from ping shows significant improvement between enable_aim value 1 and 2
  • the bulk bandwidth using iperf3 under loaded siftr shows some insignificant improvement (+2.7%)
  • no regression found

I have no problem with this patch after testing it in Emulab. The test result is in my above comment.

sys/dev/e1000/if_em.c
1550

Intel Networking ping. I need your help.

sys/dev/e1000/if_em.c
1550

It looks like this algorithm originated from Intel so it's safe to include here as long as the Intel copyright is included in this file.

Which it should've been in the first place...I'm not sure how/why it got removed.

kbowling added a reviewer: pfsense.

Remove unnecessary ixl(4) algorithm. Add intel copyright.

This revision was not accepted when it landed; it landed in state Needs Review.Oct 11 2024, 5:38 AM
This revision was automatically updated to reflect the committed changes.