Looking into a table corruption while pf_get_addrs() is running
this can be triggering a message like pf_get_astats():
pfr_get_addrs: corruption detected (-226857)
Differential D49214
pf: add two asserts and align runtime debug for pf_get_addrs() franco_opnsense.org on Mon, Mar 3, 12:03 PM. Authored by Tags None Referenced Files
Details
Looking into a table corruption while pf_get_addrs() is running pfr_get_addrs: corruption detected (-226857)
Diff Detail
Event TimelineComment Actions The description of the change doesn't make much sense. What is the race which causes us to copy fewer address than expected? Is it a bug, and here we're extending a band-aid? Or is the race benign, and then should we just remove the print? Comment Actions I'm tracking down what looks to be a memory corruption of some sort: Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x28 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80d052f4 stack pointer = 0x28:0xfffffe00b1080690 frame pointer = 0x28:0xfffffe00b10806b0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 85006 (pfctl) rdi: fffff8017b23ce60 rsi: fffffe00b10806c8 rdx: 0000000000000000 rcx: fffff8017b23ce60 r8: 0000000000000000 r9: 8080808080808080 rax: 0000000000000000 rbx: fffffe00b10806c8 rbp: fffffe00b10806b0 r10: fffff80173a54000 r11: ffffffffffffffff r12: 0000000000000000 r13: fffffe00c8482000 r14: ffffffff821edeb0 r15: fffff8017b23ce60 trap number = 12 panic: page fault cpuid = 2 time = 1740080523 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00b1080380 vpanic() at vpanic+0x131/frame 0xfffffe00b10804b0 panic() at panic+0x43/frame 0xfffffe00b1080510 trap_fatal() at trap_fatal+0x40b/frame 0xfffffe00b1080570 trap_pfault() at trap_pfault+0x46/frame 0xfffffe00b10805c0 calltrap() at calltrap+0x8/frame 0xfffffe00b10805c0 --- trap 0xc, rip = 0xffffffff80d052f4, rsp = 0xfffffe00b1080690, rbp = 0xfffffe00b10806b0 --- rn_walktree() at rn_walktree+0x54/frame 0xfffffe00b10806b0 pfr_get_addrs() at pfr_get_addrs+0x122/frame 0xfffffe00b1080710 pfioctl() at pfioctl+0x221e/frame 0xfffffe00b1080bf0 devfs_ioctl() at devfs_ioctl+0xcb/frame 0xfffffe00b1080c40 vn_ioctl() at vn_ioctl+0xce/frame 0xfffffe00b1080cb0 devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe00b1080cd0 kern_ioctl() at kern_ioctl+0x255/frame 0xfffffe00b1080d40 sys_ioctl() at sys_ioctl+0xff/frame 0xfffffe00b1080e00 amd64_syscall() at amd64_syscall+0x100/frame 0xfffffe00b1080f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00b1080f30 --- syscall (54, FreeBSD ELF64, ioctl), rip = 0x14d360b785fa, rsp = 0x14d35c65b028, rbp = 0x14d35c65b4c0 --- KDB: enter: panic Systems exhibiting this problem crash frequently (up to multiple times a day). Sometimes they happen in random locations in the vm subsystem, but this particular issue sticks out and reoccurs at a stable frequently. All vmcores we gathered are defunct so we have to resort to printfs and check locking paths. We already bumped the pfr_get_addrs() path to a write lock and tried to debug radix.c a bit, but it's crashing without any obvious failure which means this is coming from somewhere else. This counter assert was the first evidence we got after changing this to a runtime value. Apparently this line is what causes the panic returning to the stored "next" pointer after traversing the subtree: Comment Actions Have you tried testing a GENERIC-KASAN kernel? I suspect that bugzilla is a better place to continue debugging this, as the root cause is as-yet unknown. Please cc me and we can continue debugging. Comment Actions Not yet but I can. The fact of the matter is that this only happens for less than 1‰ of installations, but these can reproduce it. Not all the time but sometimes within minutes suggesting some form of traffic being involved here.
I removed the asserts from this commit. It would still be beneficial to align the error handling. The returned result is garbled anyway. I'll add a bug report later. Comment Actions In my local testing KASAN allows me to push 100 mbit max, which is going to be a challenge asking the user to deploy this in production and to trigger the bug which only happens under traffic. I can try, but not optimistic. Comment Actions Something is filling the table while we read it with the lock held in pfr_get_addrs() is what this situation tells me, which points to a missing lock somewhere else, maybe introduced as far back as https://cgit.freebsd.org/src/commit/?id=890612bbeb69f Other ideas to approach this? |