None. Would anyone be able to sanity-check this with iperf?
Details
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 38985 Build 35874: arc lint + arc unit
Event Timeline
Due to constraints with my testing environment (a xeon-d under my desk, with two bhyve VMs each with a passthru X552) I cannot provide great statistical results but I can do a basic functionality test.
In checking interrupt rates, when booting with the setting disabled I see around 30k interrupts a second (vmstat -i 1). After enabling sysctl dev.ix.0.enable_aim=1 on the sender, I see 80k interrupts a second. This behavior is consistent across reboots and interface reset with sysctl dev.ix.0.enable_aim=1 . After sysctl dev.ix.0.enable_aim=0 and resetting the interface (ifconfig ix0 down; ifconfig ix0 up) the interrupt rate returns to 30k/s due to the initialization of EITR.
On the receiver (iperf3 -s system), the setting limits the que interrupts to around 5k per second with sysctl dev.ix.0.enable_aim=1.
Whatever was trying to be done here has a big impact on sender side. @stallamr_netapp.com can you please look at this and share your results?
Kevin,
Summarizing, this AIM will basically decide the value of ITR. Without AIM, ITR value will always stay at 128. With AIM enabled, ITR value now juggles between 32 - 8192. Below is the histogram on NetApp platform (12 processors, Intel Xeon D-1557 @1.50GHZ) when I had iperf3 ran for 1min.
I donot know exactly how ITR value translates to Interrupt moderation but to best of my knowledge, this value determines a minimum gap between two consecutive interrupts. From below histogram you can see, that, with more frequent data, ITR value concentrates more around 32 and 128.
That also implies, the gap between interrupts is lower when compared to AIM disabled i.e, 128. This falls inline with your observation too.
AIM Initial ITR Values: 0: 0 1: 0 2: 0 3: 0 4: 0 5: 0 6: 0 7: 0 <16: 0 <24: 0 **<32: 369501 ** <40: 1346 <48: 411 <56: 179 <64: 60 **<128: 340558 ** **<192: 10331** <256: 851 <320: 562 <384: 479 <448: 188 <512: 78 <1024: 330 <1536: 16 <2048: 11 <2560: 8 <3072: 1 <3584: 3 <4096: 6 **<8192: 101** <12288: 5 <16384: 0 <20480: 0 <24576: 0 <28672: 0 <32768: 0 <65536: 0 <98304: 0 <131072: 0 <163840: 0 <196608: 0 <229376: 0 <262144: 0 >=262144: 0
Mark,
Thinking more on this, lets cancel/ignore this change. In other words, keep the default setting to FALSE. Anyone who wants AIM, can enable on their platforms.
For instance, at NetApp, we enable AIM and then we fine-tune (like set minimum and maximum bounds) on ITR value so that number of IRQs best fit our platforms. In short, beyond enabling AIM, NetApp has proprietary code to further fine-tune based on different platforms, so we achieve overall performance for a given platform.
Thanks,
Sai.
@stallamr_netapp.com Sai, do you see the high rate of interrupts on the sender? If we can figure that out (which I think is important for you too), this patch to enable it by default will be ok.
When this was originally implemented, Limelight (sbruno) experienced issues with this feature, as it was conflicting with iflib in some way, so it was then set to off by default.
Yeah I don't think it is, in current form, correct anywhere and if we can't figure that out it needs to be removed again. There are a couple places iflib will re-enable interrupts, and I think that is having an unexpected interaction in the combined TXRX handler https://cgit.freebsd.org/src/tree/sys/net/iflib.c#n1557
@stallamr_netapp.com @jeffrey.e.pieper_intel.com @erj I've looked pretty hard at the committed aim code for a couple days and haven't figured it out yet, just a bunch of false starts/dead ends. It's not really a problem with the committed aim code as far as I can tell, there seem to be a couple issues in iflib that somehow interact poorly with the driver, as it is re-arming way too frequently in iflib_fast_intr_rxtx with AIM enabled on the sender (DBG_COUNTER_INC(rx_intr_enables) captures this).
I think we are also rearming "tx" way too frequently in the same iflib_fast_intr_rxtx function with or without aim. For instance if I create a null tx queue enable function like:
static int ixgbe_if_tx_queue_intr_enable(if_ctx_t ctx, uint16_t rxqid) { return (0); } DEVMETHOD(ifdi_tx_queue_intr_enable, ixgbe_if_tx_queue_intr_enable),
And I get very low interrupt rate/smooth performance on the receiver without moderation, and UDP PPS goes up and loss goes down. That isn't a fix, but it does illustrate that fixing this logic will address the desired problems (lower interrupt rates, better performance, larger LROs) so I'm still trying to learn how to do this better. I think fixing this would also allow the AIM code to work as expected.
These issues are a lot more complex than they appear at first glance, unless someone spots the issue quickly can we arrange a call on it?
Some other random notes:
- I have been away from the problem space for 3 years, and a lot of my knowledge and assumptions have become outdated. Lesson learned, test the assumptions again. For instance, when I put this into service on 82574 (MSI-X em) and ixgbe I believe we used the separate TX and RX MSI-X vectors, which makes the logic a lot simpler for disabling interrupts (in ixgbe_msix_que) and then re-enabling them via the callback in iflib's _task_fn_tx and _task_fn_rx. That worked out very well in both synthetic and real user monitoring for a very demanding CDN workload.
- At some point (81be655266fa) this scheme was revised to share an interrupt vector and handle both tx and rx. This results in half the number of MSI-X vectors being needed, which can be important on some systems which run out of them, but it has the effect of making the data path significantly harder to understand.
- The goal of iflib is to handle interrupts somewhat like NAPI (https://wiki.linuxfoundation.org/networking/napi) - it's worth reading through this if not familiar
- I don't think iflib_fast_intr_rxtx is working correctly in this paradigm of not enabling interrupts when the tasks are doing useful work. The enables directly in iflib_fast_intr_rxtx are designed to handle spurious interrupts or enabling interrupts on the queue if nothing is otherwise going on but seem to be overzealous. I'm not really sure what to do about that at the moment.
Thanks Kevin. Your observations seem to fall inline with my code understanding. I have been digging into this to understand performance impact, NetApp is having after migrating to IFLIB based drivers. Here are my observations so far (mainly with respect to Intel 10G/40G drivers):
- Rx and Tx queues are still tied together. Its only the processing has been separated into different taskqueues i.e, que->msix value is same for RxqN and TxqN & so on.
- IFLIB created IRQ only Rx side but behind the scenes even Tx side got it implicitly. Please note: que->msix value is same.
- If you are enabling the interrupt on Rx or Tx, we basically enabling on both queues (Rx & Tx) implicitly.
- With AIM enabled, we are altering ITR (Interrupt Rate) based on RXQ data alone. So, if I have heavy Rx data, then my ITR value would stay on low side and would receive more interrupts i.e, minimum gap between interrupts.
May be,
- We modify AIM to consider Tx data too. And then take max(rx_data, tx_data). This way we can average out ITR based on both Rx and Tx data.
- In true spirit, if real intention of IFLIB is to split Rx and Tx queues, then we better have unique IRQs for Rx and Tx queues. So, one cannot influence the other. But this again would increase the need/demand for more IRQs on a given node.
And coming to your question, Kevin, yes I do see higher number of interrupts on both sender and receiver sides. And total number of _task_fn_tx() calls are almost same as IRQs.
Thanks,
Sai.
- In true spirit, if real intention of IFLIB is to split Rx and Tx queues, then we better have unique IRQs for Rx and Tx queues. So, one cannot influence the other. But this again would increase the need/demand for more IRQs on a given node.
It seems like it would be simplest to split tx/rx interrupts and see if the moderation rate on the rx irq was as expected. We recently bumped the max number of msx vectors in an amd64 system, so perhaps whatever decision lead to tying tx/rx together could be revisited.
You would also have to contend with the device MSI-X vector limit, however. I think it's just 64 per port for ixgbe (including the link/admin interrupt), so you'd be limited to 31/32 queues.
Reading what Sai wrote, I think the best approach from an Intel perspective is to account for Tx data in the AIM calculation; we shouldn't be splitting the Tx/Rx interrupts.
I'm just not convinced splitting Tx/Rx interrupts is an answer, due to how this hardware and more recent hardware is generally geared toward queues sharing interrupts. However, the reasoning for that sharing is probably different from what's being discussed here; these cards are supposed to support lots of VMs and not just a single PF that doesn't need to use all of the available queues.
@erj ok I agree with both of you about that change to the aim algorithm (See D30155). I just tested and by read it will be necessary but by testing it is not sufficient to fix the issue. We are still over-eager in iflib_fast_intr_rxtx with the change applied.
I checked with someone that knows the amd64 hardware well and you are right, we should try to fit into a single msi-x vector since the limits can be reached on the types of systems FreeBSD is used on.
So the goal at this point, from my perspective, is to figure out how to do iflib_fast_intr_rxtx without the gratuitous rearms. Then we can all do some testing with and without AIM and make a decision to finally merge this patch or leave it off.