Page MenuHomeFreeBSD

LinuxKPI: 802.11: make sure we can send DISASSOC or DEAUTH frames
ClosedPublic

Authored by bz on Jun 5 2024, 11:06 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sep 23 2024, 5:36 PM
Unknown Object (File)
Sep 15 2024, 9:38 PM
Unknown Object (File)
Sep 12 2024, 7:23 AM
Unknown Object (File)
Aug 15 2024, 7:21 PM
Unknown Object (File)
Aug 12 2024, 9:58 PM
Unknown Object (File)
Aug 6 2024, 5:05 AM
Unknown Object (File)
Jul 30 2024, 1:15 AM
Unknown Object (File)
Jul 29 2024, 6:08 AM

Details

Summary

The "Invalid TXQ" error from iwlwifi seems to be triggered by a
frame being sent for a sta which is no longer known to the driver/fw.

While we make sure to trigger the sending of the frame in net80211
early enough (by calling (*iv_newstate)() early on rather than at
the end), TX in LinuxKPI is run in a deferred task. When we drop the
net80211 ic lock again and re-acquire the LHW lock the packet may not
yet have made it to the driver.
Work around this between the locks by making sure
(a) no new packets get queued after this, and
(b) the TX task has run or gets cancelled and we manually push any

remaining packets out (or let lsta_free() clean them up).

The disabled packet queuing now also needs to be re-enabled in
scan_to_auth() in case an lsta is staying in service or gets re-used.

Also make sure that any following lkpi_wake_tx_queues() calls no
longer ignore queues which have not seen a prior dequeue.
This former workaround "feature" (ltxq->seen_dequeue) should be
fully garbage collected in a later change on its own.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days
PR: 274382

Test Plan

At first this needs to be compiled by itself.
It was extracted from a larger (totally unrelated work in progress)
given I kept hitting the problem and had the opportunity to debug
it.

I checked over the air that the disassociate frame goes out now.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 58079
Build 54967: arc lint + arc unit

Event Timeline

bz requested review of this revision.Jun 5 2024, 11:06 PM
adrian added a subscriber: adrian.

Ah, crap. This stuff. Yeah, I hit this in ath(4) and iwn(4) way back when. It's a pain in the butt. I even think ath(4) has a workaround to drop frames in this instance because of exactly what you found.

So, we MOSTLY have the pieces to do this cleanly - one of the reasons why I went through and did the whole ieee80211_tx_complete() thing to ensure ALL frames were completed this way in all drivers was so we could eventually do stuff like this cleaner. IIRC the challenge here is the machinery to mark the VAP and/or node down is not async / split up in pieces and when it goes through iv_newstate() it happens async to the actual TX, like you found.

This is fine, it kinda sucks but I get it. This may be a good point in time to start talking about perhaps talking about pushing net80211 TX into a deferred queue with the state change callbacks in order to properly serialise all of this gunk and get rid of some of the TX locking net80211 has to do to serialise stuff. There are/were so many places where things like newstate causes frames to be sent but expecting them to happen immediately in that context and just HOPING they are queued fast enough, or vap bss nodes changing, that it drove me bonkers when I was last trying to make this all stable.

(And for context context, a lot of this stuff made sense on a non-preempt, not SMP kernel, but .. well, now we don't have that.)

(I'd write more about this, but I'm actually cleaning up key management right now, lord help me...)

This revision is now accepted and ready to land.Jun 5 2024, 11:14 PM

The actual problem here (and in a lot of places else) is a the ic lock and that we cannot sleep. But that's a separate story as well.

The comlock is also a problem. But we shouldn't need to hold the lock whilst we do things like transmit; we're only required to do that because of how the state changes happen. The early net80211 drivers had almost no locking outside of ieee80211com. I've seen what vendors have done to net80211 to "fix locking" by changing the lock types, making some sleepable, etc ... it's just terrible.

If we split up a lot of the work into pre and post phases and defer stuff into a separate task on the same taskqueue, then a lot of these problems should go away.

Initial experience with this patch is looking good. Previously I experienced the message on boot every time and occasionally afterwards; with the patch I did not get the message on boot (and haven't seen it again since).

sys/compat/linuxkpi/common/src/linux_80211.c
1092

I don't understand this comment vis a vis the lock in the next line.

Initial experience with this patch is looking good. Previously I experienced the message on boot every time and occasionally afterwards; with the patch I did not get the message on boot (and haven't seen it again since).

I am also testing this patch now, the Invalid TXQ error message doesn't show up so far.

sys/compat/linuxkpi/common/src/linux_80211.c
1092

In theory with the current code nothing else can/should queue or dequeue to/from lsta->txq at this point so we could do the operation there lockless. But I cannot write an assertion for that to catch it.
So instead of doing it lockless I decided for consistency to leave the lock (which may or may not) avoid problems in the future in case INVARIANTs change.
I should probably just remove the comment?

sys/compat/linuxkpi/common/src/linux_80211.c
1092

Or add to the comment, e.g. We could do this lockless, but currently do not because... It was just confusing to me to have a comment that says we could do it lockless and then have a lock.