Instead, collect a batch of the received mbufs and push them into upper stack with temporary unlocked rq. PR: 281469 Reviewed by: Ariel Ehrenberg <aehrenberg@nvidia.com>
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
I'm very afraid there will be performance implications due to new cache misses here from queing mbufs twice. On tens of thounsands of interfaces running over 8 years, we've never hit a deadlock from this lock, and I don't think fixing this is important enough to hurt performance for.
Instead of doing this, can you remove the lock entirely, change it to a sleepable lock, or reduce its scope instead? From some reading I did last week at the bug session, I gather that it exists only to guard against 2 rare conditions (doing work later because something failed to restock,the ring, and bringing the interface down).
No, the lock cannot be sleepable because the processing occurs in the context of the interrupt thread.
I would implemented something with blockcount_t or even epoch, but then I realized that it would not help. blockcount cannot be used because rx memory is not type-stable. Driver-private epoch might work, but then note that the second reported backtrace in the PR 281368 shows ip stack acquiring sleepable lock. So even if I try to fix driver, the stack still tries to sleep in ip_input().
I suspect you (Netflix) did not see the deadlock because you either do not use ipv6 or use it in situation with static network configuration. The problems are visible when multicast group membership is changed, at least this is what I see in the PR.
sys/dev/mlx5/mlx5_en/mlx5_en_rx.c | ||
---|---|---|
516 | Can ether_input() be combined with software lro? |
We do have a static network config in most places.
At any rate, let's see what we can do to remove the lock entirely. What is this mutex protecting? In looking at the code, the only thing that can be called outside the interrupt context is mlx5e_post_rx_wqes() which is done in a callout if it fails in the interrupt context, so I guess its protecting the state for that?
Rather than than holding a mutex for this rare condition (I have not seen the rq callout called in hours of 100GbE load on a test box), why not just:
- If we posted *any* wqes:
- continue as normal and do not schedule the callout, as we'll be called again
- If we did not post any wqes:
- disable the irq and schedule the callout at EOI
- the callout function becomes a wrapper that calls mlx5e_post_rx_wqes() and re-enables irqs if it succeeds