Page MenuHomeFreeBSD

rtw89: Fix TX panics
ClosedPublic

Authored by ashafer_badland.io on Tue, Oct 8, 6:16 AM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Oct 17, 9:53 PM
Unknown Object (File)
Mon, Oct 14, 4:46 PM
Unknown Object (File)
Mon, Oct 14, 4:41 PM
Unknown Object (File)
Sat, Oct 12, 11:00 AM
Unknown Object (File)
Sat, Oct 12, 11:00 AM
Unknown Object (File)
Sat, Oct 12, 11:00 AM
Unknown Object (File)
Sat, Oct 12, 10:59 AM
Unknown Object (File)
Sat, Oct 12, 10:52 AM

Details

Summary

This fixes rtw89's TX path, which was previously faulting constantly
in linuxkpi_ieee80211_next_txq. Our "double scheduling" check was
slightly incorrect, checking if the next pointer was valid. The
next pointer may be NULL if the element is the last in the tailq,
we should instead check tqe_prev.

Because of this we were queuing an element to the tailq twice, and
because it was the last element and had a NULL tqe_next, that NULL
value would get propogated into another node's tqe_prev on removal,
and other such nastiness.

With this rtw89 seems to perform quite nicely, I can actually get
a very stable connection with decent 802.11a speeds.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 59843
Build 56728: arc lint + arc unit

Event Timeline

What's preventing rtw89 from getting enabled in the build? With this fix I could comfortably test youtube, iperf3, ssh. Speed was ~20mbps but stable. I think it could make sense to open this up to more people.

I think the comment could use some clarification. "tqe_prev is always valid, tqe_next may be NULL." is confusing when we then check tqe_prev for NULL.

Thanks updated, does this look more clear?

What's preventing rtw89 from getting enabled in the build? With this fix I could comfortably test youtube, iperf3, ssh. Speed was ~20mbps but stable. I think it could make sense to open this up to more people.

I suspect the only thing that prevents enabling it is the "faulting constantly in linuxkpi_ieee80211_next_txq." In other words, if it's stable with this change I imagine it should be built after this goes in.

How can we be sure tqe_prev = NULL when it isn't scheduled?

Also, this reminds me, I gotta make progress on firmware loading

linuxkpi_ieee80211 has a function it calls to zero the next/prev fields called TAILQ_ELEM_INIT, so that it could do this check.

linuxkpi_ieee80211 has a function it calls to zero the next/prev fields called TAILQ_ELEM_INIT, so that it could do this check.

Ah, cool. And there's no re-use issues? Eg, it was schedule, then we re-used the txq for some reason, being exposed to a stale prev pointer?

What's preventing rtw89 from getting enabled in the build? With this fix I could comfortably test youtube, iperf3, ssh. Speed was ~20mbps but stable. I think it could make sense to open this up to more people.

Blockers like what @misha fixed and possibly thins one. I've got drivers updates pending which may or may not restart some of this process; I haven't tested but I have locally enabled rtw89.

In D47006#1071711, @bz wrote:

What's preventing rtw89 from getting enabled in the build? With this fix I could comfortably test youtube, iperf3, ssh. Speed was ~20mbps but stable. I think it could make sense to open this up to more people.

Blockers like what @misha fixed and possibly thins one. I've got drivers updates pending which may or may not restart some of this process; I haven't tested but I have locally enabled rtw89.

I was told by @misha that rtw89 in current state (with this patch applied) stops to work for him after about 1 minute of inactivity so you have to run background ping to keep it living

Huh I think I had seen the long idle disconnect thing earlier today but didn't realize that was what it was.

Anyway, not trying to distract the review and get off in the weeds, I was just curious what the latest status was.

sys/compat/linuxkpi/common/src/linux_80211.c
6082

In theory this check is right; are you sure the panics were not related to races due to missing locking? (see IMPROVE)
So far no one really exercised this code so I am not surprised.

I am happy we have three parties now at least looking into rtw89.

In D47006#1071703, @imp wrote:

How can we be sure tqe_prev = NULL when it isn't scheduled?

Also, this reminds me, I gotta make progress on firmware loading

@imp would be nice if we could MFC the plain file firmware loading to stable/14. The other thing this reminds me is that I need to use the "gap" in the release schedule to sort per-shipped firmware with releng

sys/compat/linuxkpi/common/src/linux_80211.c
6082

Yes, I could trigger it by single stepping in kgdb to see why it faulted. It took some digging through the TAILQ_REMOVE code to figure out why things would suddenly change. Because these next/prev entries are double pointers it would cause the first/last entries in the head to change, as alluded to in the change description.

Maybe there are locking issues as well but I haven’t observed them (yet).

For reference I reproduced this after a few minutes w/o the patch:

this hits in linuxkpi_ieee80211_next_txq() at:

TAILQ_REMOVE(&lhw->scheduled_txqs[ac], ltxq, txq_entry);
Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address   = 0x0
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80de89b8
stack pointer           = 0x28:0xfffffe0068e47d10
frame pointer           = 0x28:0xfffffe0068e47d20
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (rtw89_tx_wq_2)
rdi: fffffe0086c70440 rsi: fffff800060e6200 rdx: 0000000000000020
rcx: 0000000000000000  r8: ffffffff818d39f0  r9: 0000000000000000
rax: 0000000000000002 rbx: fffffe0086c70200 rbp: fffffe0068e47d20
r10: 0000000000010000 r11: 0000000000000001 r12: fffffe0086c70440
r13: fffffe0086c70500 r14: 0000000000000002 r15: 0000000000000000
trap number             = 12
panic: page fault
cpuid = 1
time = 1728512884
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0068e479e0
vpanic() at vpanic+0x13f/frame 0xfffffe0068e47b10
panic() at panic+0x43/frame 0xfffffe0068e47b70
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe0068e47bd0
trap_pfault() at trap_pfault+0xa0/frame 0xfffffe0068e47c40
calltrap() at calltrap+0x8/frame 0xfffffe0068e47c40
--- trap 0xc, rip = 0xffffffff80de89b8, rsp = 0xfffffe0068e47d10, rbp = 0xfffffe0068e47d20 ---
linuxkpi_ieee80211_next_txq() at linuxkpi_ieee80211_next_txq+0x88/frame 0xfffffe0068e47d20
rtw89_core_txq_work() at rtw89_core_txq_work+0xa6/frame 0xfffffe0068e47df0
linux_work_fn() at linux_work_fn+0xe3/frame 0xfffffe0068e47e40
taskqueue_run_locked() at taskqueue_run_locked+0x1c2/frame 0xfffffe0068e47ec0
taskqueue_thread_loop() at taskqueue_thread_loop+0xd3/frame 0xfffffe0068e47ef0
fork_exit() at fork_exit+0x82/frame 0xfffffe0068e47f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0068e47f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 0 tid 100125 ]
Stopped at      kdb_enter+0x33: movq    $0,0x1054732(%rip)

I'll commit it, likely tomorrow; I haven't been able to reproduce the panic anymore; Thanks.

If I were to work on the "IMPROVE" bits, would you be willing to test/review some? Are you in the wireless group by any chance?

This revision is now accepted and ready to land.Thu, Oct 10, 12:08 AM

Sure, what IMPROVING file are you referring to? I'd be happy to test things, I'm trying to get a modern laptop setup on this lenovo legion laptop I have so it would be good to use rtw89 for that.

I'm not in the wireless group but I suppose I could be. I also lurk in the kernel/desktop/gaming channels in the freebsd discord if you use that, feel free to ping or email me.

Sure, what IMPROVING file are you referring to? I'd be happy to test things, I'm trying to get a modern laptop setup on this lenovo legion laptop I have so it would be good to use rtw89 for that.

There are IMPROVE macros around sysctl enabled logging in that file (and some others for wireless). There are also TODO()s.