AIM was part of BSD11 driver. Upon IFLIB migration, AIM feature got lost. Re-introducing AIM back into IFLIB based IXGBE driver.
One caveat is that in BSD11 driver - a queue comprises both Rx and Tx ring. Starting from BSD12, Rx and Tx have their own queues and rings. Also, IRQ is now only configured for Rx side. So, when AIM is re-enabled, we should now consider only Rx stats for configuring EITR register in contrast to BSD11 where Rx and Tx stats were considered to manipulate EITR register.
Details
Unit-tested with AIM feature enabled on NetApp platforms on physical NICs.
Diff Detail
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 34988 Build 31981: arc lint + arc unit
Event Timeline
Can you provide a justification in terms of performance and performance stability? This was dropped because it was believed to be unnecessary with the newer driver software queuing architecture.
At NetApp, in one of our proprietary application 64K sequential write load test, we noticed that average LRO size of the packet/segment gets shorter. And we also observed, average LRO segment size is proportional to interrupt rate and throttling. For better running of our application, we need average LRO segment size to be on higher size. With AIM enabled (and with other Intel patch), the average LRO size now come close to BSD11 LRO average segment size for same application.
So, introducing AIM back with default being disabled. So, its business as usual for the rest of community.
I would rather see this fixed by implementing LRO mbuf sorting in iflib, a la cxgbe and mlx5en. It's a small code change like https://freshbsd.org/commit/freebsd/src/317041. Would you be willing to look at this or discuss with intel to do it on your behalf?
Note that I tried to implement RSS assisted LRO in iflib ("mbuf sorting") a while back. The first step is undoing some of the mbuf queueing that happens with iflib. When doing so, @olivier 's fleet of terrible machines showed regressions in packet forwarding performance, and this is where i paused.
Also note that Netapp's problem is likely different than the problems experienced by a CDN. In the CDN case, mbuf sorting is helpful, as it allows aggregation of packets from the same TCP session which have many intermediate packets in between them. This is common in a CDN case where you have 10s or 100s of thounsands of connections. My assumption is that Netapp has just a few active connections, so mbuf sorting won't help them at all, and all they need is simply having a longer interrupt timer.
Approving for just myself for now until the discussion regarding AIM / LRO is wrapped up.
A few more thoughts based on a conversation I had with somebody: In order for mbuf sorting to be effective, you need to have a large number of packets received per irq. Sorting a dozen mbufs is not useful when you're already aggregating 64 different connections, for example. So AIM (or very large irq colaescing timer/packet settings) are required to get enough packets per irq to make mbuf sorting worthwhile.
To be clear, I think mbuf sorting is something we want, and something which could be helpful. I just think that it may not be the answer to Netapp's problem and its not fair to suggest it as an alternative to AIM. In fact, the proposed solution of restoring AIM can almost be viewed as a pre-requisite for mbuf sorting,
@gallatin I don't have the resources to fix it, so I'll just say of the things we were trying to do is get away from expert knob turning for common use cases, which I would lump a netapp filer and a CDN into for FreeBSD. iflib is supposed to not needlessly rearm interrupts while it is processing and there is no more work, so we should have some moderation of interrupts already or we need to add a callback into the framework so hardware like this can further modulate if necessary. I don't like bringing back a driver-only hack instead of holistically solving this. Nonetheless, if I am the only person that cares about that kind of thing I think my assumptions need to be retuned and I'll go away.
At least on real hardware I've used with iflib, with both ix and ixl, one needs to adjust irq moderation manually in order to get a decent amount of packets per irq. I honestly think AIM might be the best path forward for less tuning knobs, if it were the default with a wider range of min and max irq rates.
And I do think that we need to enable Hans's RSS assisted LRO. I just think its not the solution here.
sys/dev/ixgbe/if_ix.c | ||
---|---|---|
2119 | Rather than a goto I think it would be nicer to lift this code into a separate function that gets called here. | |
2139 | newitr is always 0 here. I am nervous about the lack of atomicity with respect to rxr->packets and rxr->bytes. They are 32-bit values and could conceivably overflow, resulting in a division-by-zero. I would suggest loading the values with atomic_load_int() before doing any math with them, to ensure that you're operating on a consistent snapshot of the values. | |
2164 | I believe these counters may be modified concurrently. If so, these updates may be lost. Is that acceptable? If so I think it's worth a comment. |
sys/dev/ixgbe/if_ix.c | ||
---|---|---|
2119 | This is old BSD11 code. In my opinion, I see no harm. New function in interrupt context may be little unwarranted overhead ? | |
2139 | Why do you think newitr will be 0? Like in any possible case, we will have bytes >= packets. So, at the worst newitr be 1. I donot see how can it be 0. Next, these variables are u32. And we do disable IRQ , process rxq and then enable IRQ. For atomicity, these bytes/packets get increment when we process IRQ. So, the fact that we are in IRQ routine, that means these bytes/packets are a snapshot up-until previous IRQ. Like, I donot see possibility of race between IRQs and Rx processing. We do Rx processing with IRQ disabled. Also, there is no atomicity in BSD11 too. | |
2164 | No, these counters donot get modified concurrently. Like said in above comment, we do disable IRQ - process Rx queue - re-enable IRQ. |
This is the patch Sai mentioned: https://reviews.freebsd.org/D27465 The problem is that at least for some adapters we have no influence on the number of RX descriptors, which are written back at once by HW. Using interrupts with moderation allows to aggregate more fragments in single LRO transaction and makes processing less CPU intensive. It still requires more testing with different HW and workloads so we made it configurable with sysctl.
In this case interrupts are not rearmed. iflib_rxd_avail always finds at least one ready descriptor so the RX task is re-scheduled. But in the next run it find only 2-3 descriptors for processing what makes LRO quite ineffective.
or we need to add a callback into the framework so hardware like this can further modulate if necessary. I don't like bringing back a driver-only hack instead of holistically solving this.
I like the idea but I think we should first test more with different HW and using real life workloads if possible.
sys/dev/ixgbe/if_ix.c | ||
---|---|---|
2119 | Most of the code in this function is for dealing with AIM, an optional feature. IMO it is simply easier to read the interrupt handler code if AIM code is in a separate function. It will be inlined by the compiler so there is no runtime difference. I don't think "it was this way before" is a good argument against making changes. | |
2139 | I mean, we have: newitr = 0; ... newitr = max(newitr, ...); so the max() is redundant. If it is possible for rxr->bytes / rxr->packets to be negative (I presume not), then this should be using imin() instead. With respect to the counters, I see that rxr->packets and rxr->bytes are incremented from iflib_rxeof(), which indeed disables queue interrupts, so I think you're right. This deserves a comment IMO. |
Review https://reviews.freebsd.org/D27465 helps to improve latency overall and has no dependency on this revision.
For NetApp, review D27465 and D27344 together helped to reduce latency by ~1.2ms.