Page MenuHomeFreeBSD

net80211(4): fix the connection-fail problem in wpa_supplicant(8)
AbandonedPublic

Authored by enweiwu on Feb 11 2023, 5:59 PM.
Referenced Files
Unknown Object (File)
Fri, Dec 27, 5:34 PM
Unknown Object (File)
Nov 7 2024, 8:49 PM
Unknown Object (File)
Sep 24 2024, 10:28 PM
Unknown Object (File)
Sep 24 2024, 10:19 PM
Unknown Object (File)
Sep 24 2024, 8:33 PM
Unknown Object (File)
Sep 17 2024, 11:57 AM
Unknown Object (File)
Sep 17 2024, 8:45 AM
Unknown Object (File)
Sep 17 2024, 2:54 AM

Details

Reviewers
lwhsu
bz
adrian
cy
Summary

Below is the planned commit message:

Someone may continuously see CTRL-EVENT-SCAN-FAILED and not be able to connect to an AP
when directly using wpa_supplicant(8) in the command line, and someone specifies 
ifconfig_wlan0="WPA in /etc/rc.conf will not encounter this problem because commit 5fcdc19a81115 and d06d7eb09131 use a weird way to address the problem. Even though the weird way may not
successfully connect to AP every time.

The problem is that sometimes net80211(4) doesn't pass the RTM_IEEE80211_SCAN to the routing socket
when scan results are available. So wpa_supplicant(8) will never issue
IEEE80211_IOC_SCAN_RESULTS to fetch the scan results since it is not notified by the routing socket.

I have added ieee80211_notify_scan_done() (which will send RTM_IEEE80211_SCAN to user processes
listened on a routing socket) in the early return part of scan_end(),
and removed the weird part of libexec/rc/rc.d/wpa_supplicant.

Short links for commits mentioned above: commit 5fcdc19a81115 and d06d7eb09131

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

Has this always been a problem or did something in wpa_supplicant change that it now is a problem?

There's another patch floating around which tracks the IFF_UP in wpa_supplicant (also related to this issue) but nothing to do with the scan result, which has proven to be void (non-effective).

Given scan happens after if up/down I am not entirely sure why one and the other problems are related?

I am just trying to better understand if your change indeed fixes "problems" people where seeing due to different things which weren't quite understood at that time in the past?

PS: for simplicity, can we make the git hashes links to the commits?

There may have been one or two hotels that I've stayed out where I hit this problem...
I'm not at them right now, or I'd test to see if this fixes things.

In D38508#876443, @bz wrote:

Has this always been a problem or did something in wpa_supplicant change that it now is a problem?

I traced wpa_supplicant 2.4 but the mechanism (use routing socket to wait for RTM_IEEE80211_SCAN) seems the same as 2.10.

I wrote a simple program waiting for an RTM_IEEE80211_SCAN sent from a wireless interface, and in FreeBSD 14.0-CURRENT, it never receive any RTM_IEEE80211_SCAN event until I IF down/up the interface. It should be a bug that net80211(4) doesn't tell the user process the scan results are available. I will do the same test on FreeBSD 10.3-RELEASE to find out whether net80211(4) changed.

Given scan happens after if up/down I am not entirely sure why one and the other problems are related?

I am just trying to better understand if your change indeed fixes "problems" people where seeing due to different things which weren't quite understood at that time in the past?

It seems that when IF is down, net80211(4) will call scan_done() to cancel the scan (I'm not very sure, but I marked the possible lines here: ieee80211_scan_sw.c), and in scan_done, it notifies RTM_IEEE80211_SCAN.

I have found that Linux drivers don't always notify the scan result event to nl80211, so in wpa_supplicant, Linux code just registers timeout on the scan function (wpa_driver_nl80211_scan()). Once the timeout fire, it reports the EVENT_SCAN_RESULTS to wpa_supplicant_event().

I don't know if we can modify driver_bsd.c in wpa_supplicant to do the same thing as Linux code, or if we actively send the scan result event to wpa_supplicant via net80211 (the way this patch does).

Phabricator should automatically make likes, at least for hashes in plain text:
5fcdc19a81115
d06d7eb09131

Phabricator should automatically make likes, at least for hashes in plain text:
5fcdc19a81115
d06d7eb09131

Thanks; I guess it was the `quoted text` which prevented it then.

Can you check that what you are seeing is not caused by an early return in ieee80211_scan_sw.c::scan_end() in the block which has a debug message containing "done, restart"?
It would probably be interesting to see a trace from your investigation when the problem of not reporting the scan end occurs with wlandebug +scan turned on. That'll likely give us an idea why we don't make it to either of the two current ieee80211_notify_scan_done() calls.

Reading the description of EVENT_SCAN_RESULTS in wpa/src/drivers/driver.h it sounds like the current (in theory) net80211 of notifying the driver (driver_bsd.c in wpa) at the end of a completed (or aborted) scan seem right. If not seeing that message is indeed the problem, then we should try and find out when we finish a scan and not send it and how we run into that (hence asking for the trace above). Also what driver are you testing with? The LinuxKPI once are probably the best but I just checked with an iwlwifi and the down/up calls in the rc.d script removed:

root@iwl-bhyve:~ # dmesg -a | egrep -E '(wlan99|scan)'
[86.444263] wlan99: bpf attached
[86.447664] wlan99: bpf attached
[86.447898] wlan99: Ethernet address: 84:c5:a6:82:d6:56
[86.606102] wlan99: got interface identifier from itself
[86.606104] wlan99: ifid: 86:c5:a6:ff:fe:82:d6:56
[86.606280] wlan99: starting DAD for fe80:3::86c5:a6ff:fe82:d656
[86.710184] wlan99: ieee80211_scanreq: flags 0x20052 duration 0x7fffffff mindwell 0 maxdwell 0 nssid 1
[86.713497] wlan99: ieee80211_swscan_cancel_scan: called; F_SCAN=0, vap=nomatch, signal=0
[86.714420] wlan99: ieee80211_check_scan: active scan, append, nojoin, once
[86.715170] wlan99: ieee80211_scan_update_locked: current scanner is <none:none>, switch to <wlan99:STA>
[86.716261] wlan99: ieee80211_swscan_start_scan_locked: active scan, duration 2147483647 mindwell 0 maxdwell 0, desired mode auto, flush, nojoin, once
[86.717901] wlan99: scan set 1g, 6g, 11g, 7g, 13g, 52a, 56a, 60a, 64a, 36a, 40a, 44a, 48a, 2g, 3g, 4g, 5g, 8g, 9g, 10g, 12g, 100a, 104a, 108a, 112a, 116a, 120a, 124a, 128a, 132a, 136a, 140a dwell min 20ms max 200ms
[86.720221] LKPI_80211_TRACE_MO lkpi_80211_mo_hw_scan:232: 0 100118 2147432219_CALLING hw 0xfffffe005759b340 vif 0xfffffe008e2b0cc0 sr 0xfffff80001860000
[86.724718] LKPI_80211_TRACE_MO lkpi_80211_mo_hw_scan:234: 0 100118 2147432220_RETURNING hw 0xfffffe005759b340 vif 0xfffffe008e2b0cc0 sr 0xfffff80001860000 error 0
[86.726376] wlan99: scan_curchan_task: loop start; scandone=0, scanstop=0, ss_iflags=0x2, ss_next=0, ss_last=32
[86.727479] wlan99: scan_curchan_task: chan   1b ->   1g [active, dwell min 20ms max 200ms]
[86.728360] wlan99: scan_curchan_task: waiting
[86.852667] wlan99: ieee80211_swscan_add_scan: chan   1g min dwell met (2147432232 > 2147432222)
[88.124279] wlan99: DAD complete for fe80:3::86c5:a6ff:fe82:d656 - no duplicates found
[89.276523] wlan99: ieee80211_scan_done: called
[89.277291] wlan99: scan_curchan_task: loop start; scandone=1, scanstop=1, ss_iflags=0x21, ss_next=32, ss_last=32
[89.278703] wlan99: scan_end: out
[89.279009] wlan99:  macaddr          bssid         chan  rssi  rate flag  wep  essid
[89.284695] wlan99: scan_end: done, [ticks 2147432476, dwell min 2 scanend 4294915866]
[89.285598] wlan99: scan_done: Dropping out of scan; ss_next=32, ss_last=32
[89.286372] wlan99: notify scan done
[89.314378] lkpi_sta_scan_to_auth:1109 lsta 0xfffff8000185f800 ni 0xfffffe008e344000 sta 0xfffff8000185f8c0
[89.378319] wlan99: ieee80211_swscan_cancel_scan: called; F_SCAN=0, vap=match, signal=0
[89.423409] wlan99: ieee80211_swscan_cancel_scan: called; F_SCAN=0, vap=match, signal=0
[89.424222] wlan99: [9a:9d:7e:76:6f:fa] ieee80211_scan_assoc_success
[89.424885] wlan99: link state changed to UP
sys/net80211/ieee80211_scan_sta.c
420 ↗(On Diff #116992)

So sta_add is called by ieee80211_swscan_add_scan() which is called by ieee80211_add_scan(), which (in STA mode) is called from sta_recv_mgmt(). This all operates on a single frame and not an entire scan if I am not mistaken.

Phabricator should automatically make likes, at least for hashes in plain text:
5fcdc19a81115
d06d7eb09131

I have edited the summary and now it's work :)

Can you check that what you are seeing is not caused by an early return in ieee80211_scan_sw.c::scan_end() in the block which has a debug message containing "done, restart"?

Sure! You're right. After tracing the output and net80211(4), it's very sure that we match the condition in the block you mentioned and the early return. Here is my output of dmesg(8):

ricky@wtap:~ $ dmesg -a | egrep -E '(wlan0|scan)'
/* initilization, skip */ 
wlan0: ieee80211_check_scan: active scan, append
wlan0: ieee80211_scan_update_locked: current scanner is <none:none>, switch to <wlan0:STA>
wlan0: ieee80211_swscan_start_scan_locked: active scan, duration 2147483647 mindwell 0 maxdwell 0, desired mode auto, flush
wlan0: scan set
wlan0: scan_curchan_task: loop start; scandone=0, scanstop=0, ss_iflags=0x2, ss_next=0, ss_last=11
wlan0: scan_curchan_task: chan   8g ->   1g [active, dwell min 20ms max 200ms]
wlan0: scan_curchan: calling; maxdwell=20
wlan0: scan_curchan_task: waiting

/* Repeatedly scan different channels, skip */

wlan0: scan_curchan_task: chan   9g ->  10g [active, dwell min 20ms max 200ms]
wlan0: scan_curchan: calling; maxdwell=20
wlan0: scan_curchan_task: waiting
wlan0: scan_curchan_task: loop start; scandone=1, scanstop=1, ss_iflags=0x20, ss_next=11, ss_last=11
wlan0: scan_end: out
wlan0:  macaddr          bssid         chan  rssi  rate flag  wep  essid
- 34:60:f9:cd:af:9a 34:60:f9:cd:af:9a    6    39  54M   ess  wep! "Oxy-"
 - b6:95:75:08:ca:78 b6:95:75:08:ca:78    2    15  54M   ess  wep! "YAO-atelier"
 - b0:95:75:08:ca:77 b0:95:75:08:ca:77    2    15  54M   ess  wep! "YAO-atelier-1F"
 ^ 42:74:e0:fb:66:5c 42:74:e0:fb:66:5c   11    50  54M   ess  wep! "nckuhub"
wlan0: scan_end: done, restart [ticks 2147552909, dwell min 2 scanend 68975]
wlan0: scan_curchan_task: loop start; scandone=0, scanstop=0, ss_iflags=0x0, ss_next=0, ss_last=11
wlan0: scan_curchan_task: chan  10g ->   1g [active, dwell min 20ms max 200ms]
wlan0: scan_curchan: calling; maxdwell=20
wlan0: scan_curchan_task: waiting

As we can see, it didn't reach scan_done() after scan_end() (early return) and therefore no ieee80211_notify_scan_done() is called.

Also what driver are you testing with

My NIC is rtl8188eu, and the driver is rtwn(4).

/* Lines about an experiment, one can skips theses lines and jump to the last paragraph to see the conclusion :) */

If not seeing that message is indeed the problem, then we should try and find out when we finish a scan and not send it and how we run into that (hence asking for the trace above)

I surely ran into the early return part of scan_end(). Here is the most critical condition to indicate why it early return in scan_end(): https://cgit.freebsd.org/src/tree/sys/net80211/ieee80211_scan_sw.c?h=stable/13#n846. In STA mode, ss->ss_ops->scan_end(ss, vap) is pointed to ieee80211_scan_sta::sta_pick_bss(), and this function returned 0 (so the condition is true). It didn't pick any BSS (return 0) because the ieee80211_scan_sta::match_bss() returns 0 if we further trace more codes. And I found that the lines show BSS in the dmesg(8) output above all contains "wep!", and after tracing the codes, the "!" means the BSS didn't match our requirements of wireless setting. In summary, after the scan finished, it find that no suitable BSS is there (in match_bss()'s thoughts) so didn't call scan_done() and directly restart the scanning process.

But I saw that your dmesg(8) output has scan_done() called, so one thought came to me: maybe I could try to scan more BSS, and perhaps there is one BSS that match_bss() will think it matches. So I started to walk around with my laptop and collected as many BSS as possible. And I finally found one BSS that match_bss() thinks it's match, here is the dmesg(8) output:

wlan0:  macaddr          bssid         chan  rssi  rate flag  wep  essid
 /* skip ... */
 - f4:f2:6d:f1:e1:ae f4:f2:6d:f1:e1:ae    9    25  54M   ess  wep! "R3474"
 - 14:4d:67:63:85:38 14:4d:67:63:85:38   10    25  54M   ess  wep! "MBELAB"
 + cc:b2:55:ca:e8:2b cc:b2:55:ca:e8:2b    2    23  54M   ess   no  "dlink"
wlan0: ieee80211_vap_set_shortslottime: called; onoff=1
wlan0: sta_pick_bss: terminate scan; return 1
wlan0: scan_end: done, [ticks 2147778613, dwell min 2 scanend 294673]
wlan0: scan_done: Dropping out of scan; ss_next=11, ss_last=11
wlan0: notify scan done
wlan0: [cc:b2:55:ca:e8:2b] send auth on channel 2
wlan0: ni 0xfffffe00557e1000 vap 0xfffff8006446d000 mode STA state AUTH m 0xfffff800643d9700 status 0
wlan0: ni 0xfffffe00557e1000 mode STA state AUTH ostate 2 arg 0x2 status 0
wlan0: vap 0xfffff8006446d000 mode STA state AUTH flags 0x2400 & 0x80
wlan0: ieee80211_swscan_cancel_scan: called; F_SCAN=0, vap=match, signal=0
wlan0: [cc:b2:55:ca:e8:2b] ieee80211_scan_assoc_fail: reason 1
wlan0: [cc:b2:55:ca:e8:2b] sta_assoc_fail: reason 1 fails 1
wlan0: ieee80211_check_scan: active scan, append
wlan0: cache hot; ic_lastscan=-2147188689, scanvalid=6000, ticks=-2147188472
wlan0:  macaddr          bssid         chan  rssi  rate flag  wep  essid
/* ... */

As we can see, the "dlink" has no "!" in the wep field, and match_bss() thinks it's a match, so finally it called scan_done() since it didn't early return in scan_end(). This result is more like yours I thought.

/* Skip until here */

So sta_add is called by ieee80211_swscan_add_scan() which is called by ieee80211_add_scan(), which (in STA mode) is called from sta_recv_mgmt(). This all operates on a single frame and not an entire scan if I am not mistaken.

Yes, you're right. I found that the place I put the notification call is not suitable after seeing your comment. The ieee80211_notify_scan_done() is more reasonable to be called after the whole scanning process is done rather than receiving a probe response or beacon.

I think it's reasonable to call ieee80211_notify_scan_done() at the end of the early return part of scan_end(), since we indeed have some scan results, and wpa_supplicant(8) may think they're suitable BSS and do the further auth/assoc. So do you think appending one line (ieee80211_notify_scan_done()) in the early return part of scan_end() is reasonable?

iirc, and don't quote me on this, the "scan done + restart" stuff can happen when the software scanner is doing things like background scans (which break softmac 11n/11ac/etc devices, sigh, someone needs to employ me for 6 months to properly fix that for softmac devices already) it can start and stop a scan, push up scan results, but not complete it. background scans can be interrupted by active traffic, which means it'll try to do a piecemeal scan before pushing things up.

adding scan done after sta add is definitely the wrong solution :-) bz's on the right track. we need to go find when the scan done at the end of the channel list is hit in the scan_sta state machine and /then/ ensure we're always pushing that scan routing message up.

(also I thought we had a program somewhere in tools/tools/net80211 that would listen to the routing socket for all net80211 messages, to make debugging this easier? if not, we should write one :-)

enweiwu edited the summary of this revision. (Show Details)

Move ieee80211_notify_scan_done() to a more reasonable place after seeing the comments of @bz.

I surely ran into the early return part of scan_end(). Here is the most critical condition to indicate why it early return in scan_end(): https://cgit.freebsd.org/src/tree/sys/net80211/ieee80211_scan_sw.c?h=stable/13#n846. In STA mode, ss->ss_ops->scan_end(ss, vap) is pointed to ieee80211_scan_sta::sta_pick_bss(), and this function returned 0 (so the condition is true). It didn't pick any BSS (return 0) because the ieee80211_scan_sta::match_bss() returns 0 if we further trace more codes. And I found that the lines show BSS in the dmesg(8) output above all contains "wep!", and after tracing the codes, the "!" means the BSS didn't match our requirements of wireless setting. In summary, after the scan finished, it find that no suitable BSS is there (in match_bss()'s thoughts) so didn't call scan_done() and directly restart the scanning process.

Question: were any of the results supposed to match? Was there a BSSID you may have joined despite match_bss() thinking you should not?

In wpa_driver_bsd_scan() we call wpa_driver_bsd_set_wpa(,1) which enables PRIVACY and sets WPA1|WPA2 (the magic number 3 there).
To my understanding IEEE80211_IOC_SCAN_MAX_SSID should be set, which means we set

sr.sr_flags = IEEE80211_IOC_SCAN_ACTIVE | IEEE80211_IOC_SCAN_ONCE |
        IEEE80211_IOC_SCAN_NOJOIN;
sr.sr_duration = IEEE80211_IOC_SCAN_FOREVER;

[Note: should validate this for the default build, not just my local one]

Now that would mean SCAN_ONCE should not be unset and we should not enter in the early case above even if match_bss et al decide they couldn't find a result.
The [86.710184] wlan99: ieee80211_scanreq: flags 0x20052 duration 0x7fffffff mindwell 0 maxdwell 0 nssid 1 line could confirm the flags set (CHECK|NOJOIN|ONCE|ACTIVE).

Now we both run into check_scan but with different arguments, which I do not understand yet.

[86.714420] wlan99: ieee80211_check_scan: active scan, append, nojoin, once
             wlan0: ieee80211_check_scan: active scan, append

Which supplicant and version are you using? I currently cannot find a place which would result in ACTIVE|~FLUSH as a set,

Now that would mean SCAN_ONCE should not be unset and we should not enter in the early case above even if match_bss et al decide they couldn't find a result.

Sorry for not telling you that the dmesg(8) output I put earlier is IF up without wpa_supplicant, so the IEEE80211_IOC_SCAN_ONCE is unset due to the default setting of scan flag by net80211. Why I posted the output without wpa_supplicant is because directly-calling wpa_supplicant (without IF down/up) will become no effect, since wpa_supplicant issues the IEEE80211_IOC_SCAN_REQ with flag = IEEE80211_SCAN_ACTIVE | IEEE80211_SCAN_NOJOIN | IEEE80211_SCAN_ONCE but ieee80211_swscan_start_scan_locked() will complain that scan is already in progress, so the flag seems to have no effect.

I have to apologize again for the messy output I put in earlier. Although the directly-calling wpa_supplicant does not affect most part of the scanning process, IEEE80211_IOC_PRIVACY and IEEE80211_IOC_WPA do have an effect so the output I put earlier shouldn't have wep with "!" in it if I run wpa_supplicant.

I do the experiment carefully againg, and I split the experiment into three parts:

  1. IF up
  2. directly calling wpa_supplicant (no effect on scanning)
  3. wpa_supplicant with IF down/up (by weird two lines in /etc/rc.d/wpa_supplicant) (has an effect on scanning)

Please, see the dmesg(8) output below, in which I have put many comments to describe the whole thing in more details.

/* ------------ Part 1 ------------------
 * IF up (before wpa_supplicant)
 * $ ifconfig wlan5 create wlandev rtwn0
 * $ wlandebug -i wlan5 +scan +state
 * $ ifconfig wlan5 up
 */

wlan6: Ethernet address: 50:3e:aa:8e:fe:32
wlan6: start running, 0 vaps running
wlan6: ieee80211_start_locked: up parent rtwn0
wlan6: start running, 1 vaps running
wlan6: ieee80211_new_state_locked: INIT -> SCAN (arg 0) (nrunning 0 nscanning 0)
wlan6: ieee80211_newstate_cb: INIT -> SCAN arg 0
wlan6: sta_newstate: INIT -> SCAN (0)
wlan6: ieee80211_swscan_cancel_scan: called; F_SCAN=0, vap=nomatch, signal=0

// sta_newstate() finds that (vap->iv_flags_ext & IEEE80211_FEXT_SCANREQ == 0),
// so ieee80211_check_scan_current() is called, with flags = IEEE80211_SCAN_ACTIVE

wlan6: ieee80211_check_scan: active scan, append
wlan6: ieee80211_scan_update_locked: current scanner is <none:none>, switch to <wlan6:STA>
wlan6: ieee80211_swscan_start_scan_locked: active scan, duration 2147483647 mindwell 0 maxdwell 0, desired mode auto, flush
wlan6: scan set
1g, 6g, 11g, 7g, 2g, 3g, 4g, 5g, 8g, 9g, 10g dwell min 20ms max 200ms
wlan6: scan_curchan_task: loop start; scandone=0, scanstop=0, ss_iflags=0x2, ss_next=0, ss_last=11
wlan6: scan_curchan_task: chan  11n ->   1g [active, dwell min 20ms max 200ms]
wlan6: scan_curchan: calling; maxdwell=20
wlan6: scan_curchan_task: waiting

/* skiped, scan_curchan_task on different channels */

wlan6: scan_curchan_task: loop start; scandone=1, scanstop=1, ss_iflags=0x20, ss_next=11, ss_last=11
wlan6: scan_end: out
wlan6:  macaddr          bssid         chan  rssi  rate flag  wep  essid
 - 34:60:f9:cd:af:9a 34:60:f9:cd:af:9a    6    35  54M   ess  wep! "Oxy-"
 - 42:74:e0:fb:66:5c 42:74:e0:fb:66:5c   11    50  54M   ess  wep! "nckuhub"
wlan6: scan_end: done, restart [ticks 2147877238, dwell min 2 scanend 393273]
wlan6: scan_curchan_task: loop start; scandone=0, scanstop=0, ss_iflags=0x0, ss_next=0, ss_last=11
wlan6: scan_curchan_task: chan  11n ->   1g [active, dwell min 20ms max 200ms]
wlan6: scan_curchan: calling; maxdwell=20
wlan6: scan_curchan_task: waiting

/* skiped, scan_curchan_task on different channels */

wlan6: scan_curchan_task: loop start; scandone=0, scanstop=0, ss_iflags=0x20, ss_next=6, ss_last=11
wlan6: scan_curchan_task: chan   3g ->   4g [active, dwell min 20ms max 200ms]
wlan6: scan_curchan: calling; maxdwell=20
wlan6: scan_curchan_task: waiting

/* ------------ Part 2 ------------------
 * wpa_supplicant without IF down/up
 * $ wpa_supplicant -i wlan6 -c wpa.conf (run wpa_supplicant without IF down/up)
 */

// wpa_supplicant issues IEEE80211_IOC_SCAN_REQ with
// flags = IEEE80211_SCAN_ACTIVE | IEEE80211_SCAN_NOJOIN | IEEE80211_SCAN_ONCE

wlan6: ieee80211_scanreq: flags 0x20052 duration 0x7fffffff mindwell 0 maxdwell 0 nssid 1
wlan6: ieee80211_check_scan: active scan, append, nojoin, once

// failed, since scan already in progress, so flags has no effect

wlan6: ieee80211_swscan_start_scan_locked: active scan already in progress

// scan as usual, flags = IEEE80211_SCAN_ACTIVE

wlan6: scan_curchan_task: loop start; scandone=0, scanstop=0, ss_iflags=0x20, ss_next=7, ss_last=11
wlan6: scan_curchan_task: chan   4g ->   5g [active, dwell min 20ms max 200ms]
wlan6: scan_curchan: calling; maxdwell=20
wlan6: scan_curchan_task: waiting

// wpa_supplicant issued IEEE80211_IOC_SCAN_REQ and failed again

wlan6: ieee80211_scanreq: flags 0x20052 duration 0x7fffffff mindwell 0 maxdwell 0 nssid 1
wlan6: ieee80211_check_scan: active scan, append, nojoin, once
wlan6: ieee80211_swscan_start_scan_locked: active scan already in progress

/* skiped, many scan_curchan_task() calls as usual, with flags = IEEE80211_SCAN_ACTIVE */

/* ------------ Part 3 ------------------
 * wpa_supplicant with IF down/up
 * $ /etc/rc.d/wpa_supplicant start wlan5
 */

Feb 18 07:18:38 wtap wpa_supplicant[3109]: ioctl[SIOCS80211, op=20, val=0, arg_len=7]: Invalid argument
wlan6: stop running, 1 vaps running
wlan6: ieee80211_new_state_locked: SCAN -> INIT (arg -1) (nrunning 0 nscanning 0)
wlan6: ieee80211_swscan_cancel_scan: cancel active scan
wlan6: down parent rtwn0
wlan6: scan_curchan_task: loop start; scandone=1, scanstop=1, ss_iflags=0x28, ss_next=11, ss_last=11
wlan6: scan_end: out
wlan6: scan_end: done, [ticks 2147878737, dwell min 2 scanend 394782]
wlan6: scan_done: Dropping out of scan; ss_next=11, ss_last=11
wlan6: notify scan done
wlan6: ieee80211_newstate_cb: SCAN -> INIT arg -1
wlan6: sta_newstate: SCAN -> INIT (-1)
wlan6: ieee80211_swscan_cancel_scan: called; F_SCAN=0, vap=match, signal=0
wlan6: ieee80211_scan_flush
wlan6: start running, 0 vaps running
wlan6: ieee80211_start_locked: up parent rtwn0
wlan6: ieee80211_scanreq: flags 0x20052 duration 0x7fffffff mindwell 0 maxdwell 0 nssid 1
wlan6: ieee80211_new_state_locked: INIT -> SCAN (arg 0) (nrunning 0 nscanning 0)
wlan6: start running, 1 vaps running
wlan6: ieee80211_newstate_cb: INIT -> SCAN arg 0
wlan6: sta_newstate: INIT -> SCAN (0)
wlan6: ieee80211_swscan_cancel_scan: called; F_SCAN=0, vap=match, signal=0

// ieee80211_check_scan() is called by sta_newstate(),
// flags = IEEE80211_SCAN_ACTIVE | IEEE80211_SCAN_NOJOIN | IEEE80211_SCAN_ONCE
// do has effect, since no scan is processing

wlan6: ieee80211_check_scan: active scan, append, nojoin, once
wlan6: cache hot; ic_lastscan=-2147088565, scanvalid=6000, ticks=-2147087931
wlan6: sta_pick_bss: no scan candidate, join=1, return 0
wlan6: ieee80211_swscan_check_scan: scan_end returned 0
wlan6: ieee80211_swscan_start_scan_locked: active scan, duration 2147483647 mindwell 0 maxdwell 0, desired mode auto, append, nojoin, once
wlan6: scan set
1g, 6g, 11g, 7g, 2g, 3g, 4g, 5g, 8g, 9g, 10g dwell min 20ms max 200ms
wlan6: scan_curchan_task: loop start; scandone=0, scanstop=0, ss_iflags=0x2, ss_next=0, ss_last=11
wlan6: scan_curchan_task: chan  11n ->   1g [active, dwell min 20ms max 200ms]
wlan6: scan_curchan: calling; maxdwell=20
wlan6: scan_curchan_task: waiting
wlan6: scan_curchan_task: loop start; scandone=0, scanstop=0, ss_iflags=0x20, ss_next=10, ss_last=11
wlan6: scan_curchan_task: chan   9g ->  10g [active, dwell min 20ms max 200ms]
wlan6: scan_curchan: calling; maxdwell=20
wlan6: scan_curchan_task: waiting
wlan6: scan_curchan_task: loop start; scandone=1, scanstop=1, ss_iflags=0x20, ss_next=11, ss_last=11
wlan6: scan_end: out
wlan6:  macaddr          bssid         chan  rssi  rate flag  wep  essid
 - 34:60:f9:cd:af:9a 34:60:f9:cd:af:9a    6    35  54M   ess  wep  "Oxy-"!
 - 42:74:e0:fb:66:5c 42:74:e0:fb:66:5c   11    53  54M   ess  wep  "nckuhub"!
wlan6: scan_end: done, [ticks 2147879645, dwell min 2 scanend 395716]

// Didn't find appropriate BSS, but it's OK, 
// since flag has IEEE80211_SCAN_ONCE, so scan_done() is called

wlan6: scan_done: Dropping out of scan; ss_next=11, ss_last=11
wlan6: notify scan done

wlan6: [42:74:e0:fb:66:5c] station assoc via MLME
wlan6: ieee80211_new_state_locked: SCAN -> AUTH (arg 192) (nrunning 0 nscanning 0)
wlan6: ieee80211_newstate_cb: SCAN -> AUTH arg 192
wlan6: sta_newstate: SCAN -> AUTH (192)
wlan6: ni 0xfffffe0094a5b000 vap 0xfffff80044de9000 mode STA state AUTH m 0xfffff8001be85e00 status 0
wlan6: ni 0xfffffe0094a5b000 mode STA state AUTH ostate 2 arg 0x2 status 0
wlan6: vap 0xfffff80044de9000 mode STA state AUTH flags 0x2400 & 0x80
wlan6: ieee80211_new_state_locked: AUTH -> SCAN (arg 1) (nrunning 0 nscanning 0)
wlan6: ieee80211_newstate_cb: AUTH -> SCAN arg 1
wlan6: sta_newstate: AUTH -> SCAN (1)
wlan6: ieee80211_swscan_cancel_scan: called; F_SCAN=0, vap=match, signal=0
wlan6: [42:74:e0:fb:66:5c] ieee80211_scan_assoc_fail: reason 1
wlan6: [42:74:e0:fb:66:5c] sta_assoc_fail: reason 1 fails 1

Why the directly-calling wpa_supplicant has no effect is probably because the scanning process is already fired when IF up in the earlier by net80211, and the worse thing is the scanning process fired earlier is likely to run infinitely since it didn't successfully pick a BSS, and also the IEEE80211_IOC_SCAN_ONCE is unset, so it runs into the early return part in scan_end() we discuss before. And why the wpa_supplicant with IF down/up will work is probably (I'm not very sure) because sta_newstate() finds that vap->iv_flags_ext & IEEE80211_FEXT_SCANREQ == 1 so the scan is fired with scan_req flag passed by the wpa_supplicant.

I found that adding ieee80211_notify_scan_done() in the early return part in scan_end() is not very reasonable, since someone may just want net80211 to deal with scanning and auth process (I think the case is without WPA). And the early return part of scan_end() means no BSS is suitable (match_bss() thinks), so net80211 doesn't need to notify the user process.

So I think the question is wpa_supplicant cannot interrupt the scanning process already fired so the scan_req flag cannot pass into net80211. Does "calling ieee80211_scan_sw.c::scan_signal() to pass scan_req flag in the case that scanning process is already in progress" make sense?

Question: were any of the results supposed to match? Was there a BSSID you may have joined despite match_bss() thinking you should not?

I'm not sure, but if there is no scan_ssid=1 in wpa_supplicant.conf seems that match_bss() will fail since the ssid wpa_supplicant passed is "". But it seems OK because the scanning process will finally call scan_done() because there is IEEE80211_SCAN_ONCE set in the flag, so wpa_supplicant will finally get the scan results despite fail of match_bss().

I have to apologize again for the messy output I put in earlier.

No need to apologize! This is fantastic work tracing this down.

Why the directly-calling wpa_supplicant has no effect is probably because the scanning process is already fired when IF up in the earlier by net80211, and the worse thing is the scanning process fired earlier is likely to run infinitely since it didn't successfully pick a BSS, and also the IEEE80211_IOC_SCAN_ONCE is unset, so it runs into the early return part in scan_end() we discuss before.

And why the wpa_supplicant with IF down/up will work is probably (I'm not very sure) because sta_newstate() finds that vap->iv_flags_ext & IEEE80211_FEXT_SCANREQ == 1 so the scan is fired with scan_req flag passed by the wpa_supplicant.

I think the reason might be (though I haven't traced this yet) is that we exploit a race with ic_parent and the 2d time we come through ieee80211_start_locked() though this time with the interface UP. This then calls ieee80211_new_state_locked() which will then kick off another task. If we manage to call the ioctl for scanning in between we'll still be in INIT.

wlan6: start running, 0 vaps running
wlan6: ieee80211_start_locked: up parent rtwn0
wlan6: ieee80211_scanreq: flags 0x20052 duration 0x7fffffff mindwell 0 maxdwell 0 nssid 1
wlan6: ieee80211_new_state_locked: INIT -> SCAN (arg 0) (nrunning 0 nscanning 0)
wlan6: start running, 1 vaps running
wlan6: ieee80211_newstate_cb: INIT -> SCAN arg 0
wlan6: sta_newstate: INIT -> SCAN (0)

Basically the IFF_DOWN kicks us back to INIT, no scan running and then on IFF_UP it's a simple race for the 1st VAP (only).

I am still contemplating though what caused the problem to appear at some point and when was that point (possibly long before user reports)?
I keep wondering (though I cannot see the change of state yet) if d807e289db96366da9682505d93ab899fe52739b in the hostap.git is the cause as we used to down the interface actively during setup and then during scan made sure it would be up again. But I have a local modification of that (and no ifconfig down/up in the rc script) and I see no difference in the requested IFF state:

/var/log/debug.log:Feb 14 23:58:55 iwl-bhyve wpa_supplicant[1099]: bsd_ctrl_iface: if wlan99 (no change) enable 0 IFF_UP 0 
/var/log/debug.log:Feb 14 23:58:55 iwl-bhyve wpa_supplicant[1100]: bsd_ctrl_iface: if wlan99 (no change) enable 1 IFF_UP 1 
/var/log/debug.log:Feb 14 23:58:57 iwl-bhyve wpa_supplicant[1100]: bsd_ctrl_iface: if wlan99 (no change) enable 1 IFF_UP 1

But then I only do this automatically on boot but not manually. I'll email the diff out to you to see (hopefully tomorrow).

I found that adding ieee80211_notify_scan_done() in the early return part in scan_end() is not very reasonable, since someone may just want net80211 to deal with scanning and auth process (I think the case is without WPA). And the early return part of scan_end() means no BSS is suitable (match_bss() thinks), so net80211 doesn't need to notify the user process.

ACK

So I think the question is wpa_supplicant cannot interrupt the scanning process already fired so the scan_req flag cannot pass into net80211. Does "calling ieee80211_scan_sw.c::scan_signal() to pass scan_req flag in the case that scanning process is already in progress" make sense?

I think the larger question is: do we want to adjust net80211 or do we want to do what others are also doing and in case we get an EINPROGRESS back in wpa_supplicant, start a timer and then simply manually query the scan results?

If we want to adjust net80211 we need to either
(1) find a way to get out of the initial scan (or better said to allow a new scan to be started, or as you say the scan to be adjusted) and start the next round of scanning with adjusted parameters. Then a further question will be -- what if multiple supplicants will start fighting over the scan; think wpa_supplicant vs. a user typing ifconfig wlan6 scan (which would also include NOPICK)?
or (2) as adrian had said find the appropriate moment to (unconditionally) report scan results.

Or a mix of user-space and net80211 changes for double stability?

I think the reason might be (though I haven't traced this yet) is that we exploit a race with ic_parent and the 2d time we come through ieee80211_start_locked() though this time with the interface UP. This then calls ieee80211_new_state_locked() which will then kick off another task. If we manage to call the ioctl for scanning in between we'll still be in INIT. Basically the IFF_DOWN kicks us back to INIT, no scan running and then on IFF_UP it's a simple race for the 1st VAP (only).

Do you mean that why the IF down/up can work is because that the ioctl for scanning win the race of queueing vap->iv_nstate_task to ic->ic_tq, so it can run ieee80211_newstate_cb() (has the chance to turn vap from INIT -> SCAN) first and the one call ieee80211_start_locked() (IF up) have no chance to run since it queue the task secondly? If yes, it's like a hack!

I am still contemplating though what caused the problem to appear at some point and when was that point (possibly long before user reports)?
I keep wondering (though I cannot see the change of state yet) if d807e289db96366da9682505d93ab899fe52739b in the hostap.git is the cause as we used to down the interface actively during setup and then during scan made sure it would be up again. But I have a local modification of that (and no ifconfig down/up in the rc script) and I see no difference in the requested IFF state:

Your inference seems right. Before the commit d807e289db96366da9682505d93ab899fe52739b in the hostap.git, by interfering with your statement and tracing some codes, I think wpa_supplicant's scan request may win the race we mentioned above since the scan request happens right behind the IF up. If what I said is true, even the code before commit d807e289db96366da9682505d93ab899fe52739b is not very stable since one firing scanning request have a little possibility to lose to one firing IF up (maybe if ic_parent runs fast).

But then I only do this automatically on boot but not manually. I'll email the diff out to you to see (hopefully tomorrow).

Sure!

I think the larger question is: do we want to adjust net80211 or do we want to do what others are also doing and in case we get an EINPROGRESS back in wpa_supplicant, start a timer and then simply manually query the scan results?

It seems that adding a timer for scan results on a scan request in wpa_supplicant and not modifying net80211(4) is the most harmless way.

If we want to adjust net80211 we need to either
(1) find a way to get out of the initial scan (or better said to allow a new scan to be started, or as you say the scan to be adjusted) and start the next round of scanning with adjusted parameters. Then a further question will be -- what if multiple supplicants will start fighting over the scan; think wpa_supplicant vs. a user typing ifconfig wlan6 scan (which would also include NOPICK)?
Or a mix of user-space and net80211 changes for double stability?

For (1), if we manage to adjust the scan state (flags, ssid, and duration) at the end of the scan, it seems to be safe if we ifconfig wlan0 scan in between a scan firing by wpa_supplicant, since wpa_supplicant will continuously fire scan request, so even if the flags ifconfig passed have no IEEE80211_IOC_SCAN_ONCE in it, it's OK because the latter scan request firing by wpa_supplicant will adjust the scan state after scan end. But I don't think I have configured all the potential problems :)

Or a mix of user-space and net80211 changes for double stability?

Sure! Furthermore, this may help someone who doesn't use wpa_supplicant. (But I don't know does there exist the person)