When the code triggering the panic has not yet entered the network epoch, dumping the kernel will fail, since the code assumes that the network epoch is entered when performing software LRO. Therefore disable software LRO during dumping.
Details
Use sudo sysctl debug.kdb.panic=1 to trigger a panic and then use the dump command to save a core to a remote server.
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
db> dump debugnet: overwriting mbuf zone pointers debugnet_connect: searching for gateway MAC... panic: Assertion in_epoch(net_epoch_preempt) failed at /root/freebsd-src/sys/netinet/tcp_lro.c:1502 cpuid = 0 time = 1649045641 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0xc7/frame 0xfffffe00897da010 kdb_backtrace() at kdb_backtrace+0xd3/frame 0xfffffe00897da170 vpanic() at vpanic+0x2b8/frame 0xfffffe00897da250 panic() at panic+0xb5/frame 0xfffffe00897da310 tcp_lro_flush_all() at tcp_lro_flush_all+0x48f/frame 0xfffffe00897da390 vtnet_rxq_eof() at vtnet_rxq_eof+0x17c0/frame 0xfffffe00897da570 vtnet_debugnet_poll() at vtnet_debugnet_poll+0xa3/frame 0xfffffe00897da5b0 debugnet_arp_gw() at debugnet_arp_gw+0x53c/frame 0xfffffe00897da6f0 debugnet_connect() at debugnet_connect+0x904/frame 0xfffffe00897da870 netdump_start() at netdump_start+0x2c5/frame 0xfffffe00897da9b0 dump_start() at dump_start+0x2ac/frame 0xfffffe00897dab30 cpu_minidumpsys() at cpu_minidumpsys+0x10ff/frame 0xfffffe00897dacb0 dumpsys_generic() at dumpsys_generic+0x160/frame 0xfffffe00897daea0 doadump() at doadump+0xe8/frame 0xfffffe00897daed0 db_dump() at db_dump+0x4a/frame 0xfffffe00897daef0 db_command() at db_command+0x441/frame 0xfffffe00897db090 db_command_loop() at db_command_loop+0x82/frame 0xfffffe00897db0b0 db_trap() at db_trap+0x27f/frame 0xfffffe00897db1f0 kdb_trap() at kdb_trap+0x2c3/frame 0xfffffe00897db2f0 trap() at trap+0x506/frame 0xfffffe00897db4e0 calltrap() at calltrap+0x8/frame 0xfffffe00897db4e0 --- trap 0x3, rip = 0xffffffff817737db, rsp = 0xfffffe00897db5b0, rbp = 0xfffffe00897db5d0 --- kdb_enter() at kdb_enter+0x6b/frame 0xfffffe00897db5d0 vpanic() at vpanic+0x324/frame 0xfffffe00897db6b0 panic() at panic+0xb5/frame 0xfffffe00897db770 __rw_wlock_hard() at __rw_wlock_hard+0x1179/frame 0xfffffe00897db8d0 _rw_wlock_cookie() at _rw_wlock_cookie+0x1d7/frame 0xfffffe00897db9a0 cc_deregister_algo() at cc_deregister_algo+0x2e/frame 0xfffffe00897db9e0 cc_modevent() at cc_modevent+0x16e/frame 0xfffffe00897dba10 module_unload() at module_unload+0x4e/frame 0xfffffe00897dba30 linker_file_unload() at linker_file_unload+0x46b/frame 0xfffffe00897dbb40 kern_kldunload() at kern_kldunload+0x340/frame 0xfffffe00897dbb90 vfs_byname_kld() at vfs_byname_kld+0x151/frame 0xfffffe00897dbc50 sys_mount() at sys_mount+0x1de/frame 0xfffffe00897dbd30 amd64_syscall() at amd64_syscall+0x40c/frame 0xfffffe00897dbf30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00897dbf30 --- syscall (198, FreeBSD ELF64, nosys), rip = 0x2ad12a, rsp = 0x825db4f08, rbp = 0x825db4f80 --- Uptime: 1m20s
The problem only shows up if the panic happens without being in the network epoch. For panics where you are already in the network epoch, the dump works.
I suspect the right solution is to somehow ensure that LRO is not in use when dumping. Either by checking dumping in vtnet_software_lro() or (probably better) clearing the software LRO flag in the vtnet softc in vtnet_debugnet_event().
BTW, I'm a little confused by vtnet_rxq_input(): doesn't it pass all input packets to vtnet_lro_rx(), not just TCP packets?
Hmm. The proposed fix is similar to what is done in iflib.c.
BTW, I'm a little confused by vtnet_rxq_input(): doesn't it pass all input packets to vtnet_lro_rx(), not just TCP packets?
Not sure. Would need to look at the code...
Well, the commit that added that just slapped net_epoch sections around all iflib_rxeof() calls. In the debugnet it doesn't make sense, since we won't call ether_input() when dumping: debugnet swaps out the if_input pointer.
We talked about this at the FreeBSD transport call. glebius@ suggested to follow markj@'s suggestion to disable TCP LRO when dumping the kernel. tuexen@ will look at it.
To sum up what we discussed on the call:
- If we really want to enter epoch for network dumping, that should be done in the netdumper code, not copy-pasted to every driver.
- Other option is to disable assertions when we are in dumper.
- Disabling LRO is a good idea. In general for a dumper we want to execute as little code as possible and prefer simple code over high performance.
sys/dev/virtio/network/if_vtnet.c | ||
---|---|---|
4414 | I think this comment is too narrow: the real reason to disable LRO is that we simply don't want to use features not strictly required for netdump's functionality. |
sys/dev/virtio/network/if_vtnet.c | ||
---|---|---|
4414 | But if we write that we want to disable all features no strictly required for dumping, shouldn't then the code not also disable TSO, checksum offloading and possibly more? |
sys/dev/virtio/network/if_vtnet.c | ||
---|---|---|
4414 | Yes, in general features that we don't strictly need should be off when possible. At this point the system has panicked, so we also want to avoid reconfiguration operations which involve executing lots of driver code, so there's a tradeoff. To be clear, I'm ok with the change. |