pf: do a lock dance in pf_unlink_state()
Changes PlannedPublic
Actions

Authored by franco_opnsense.org on Fri, Mar 21, 10:24 AM.

Details

Reviewers

kp
vegeta_tuxpowered.net
markj

Summary

Both pf_test() and pf_test6() can end up in a panic while
executing PF_UNLOCK_STATE which points to the state being
removed while it is in use.

The PF_LOCK_STATE in the removal subroutine makes sure
that pf_test/pf_test6 are no longer holding the state
and we can safely test and set PFTM_UNLINK.

The other bits of the OpenBSD commit could apply as well
but for now make sure that this particular panic comes to
and end.

Based on: https://github.com/openbsd/src/commit/9d9f4dc6c83

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 63020
Build 59904: arc lint + arc unit

Event Timeline

franco_opnsense.org created this revision.Fri, Mar 21, 10:24 AM

Herald added subscribers: vegeta_tuxpowered.net, glebius, melifaro and 2 others. · View Herald TranscriptFri, Mar 21, 10:24 AM

franco_opnsense.org requested review of this revision.Fri, Mar 21, 10:24 AM

Harbormaster completed remote builds in B63020: Diff 152491.Fri, Mar 21, 10:24 AM

franco_opnsense.org added reviewers: kp, vegeta_tuxpowered.net, markj.Fri, Mar 21, 10:24 AM

glebius added inline comments.Fri, Mar 21, 10:43 AM

sys/netpfil/pf/pf.c
2733	Can you please explain how does this work? The above assertion assumes that the lock is already held. This lock also cannot recurse. I bet if you apply this patch to FreeBSD, it would compile, but the function would panic on first execution for the attempt to recursively lock non-recursive mutex. You can't just take revisions from OpenBSD and make them compilable on FreeBSD. OpenBSD decided to move pf out of kernel Giant lock 12 years after FreeBSD and they are doing it in a different way.

This comment has been deleted.

sys/netpfil/pf/pf.c

2733

Can you please explain how does this work? The above assertion assumes that the lock is already held. This lock also cannot recurse. I bet if you apply this patch to FreeBSD, it would compile, but the function would panic on first execution for the attempt to recursively lock non-recursive mutex.

The state is currently locked by pf_test() or pf_test6(). Once the code there hits PF_STATE_UNLOCK(s) it panics. I have hundreds of panics that attest to that spanning over many years:

db:0:kdb.enter.default>  bt
Tracing pid 0 tid 100007 td 0xfffff8000356a740
kdb_enter() at kdb_enter+0x33/frame 0xfffffe000857a700
panic() at panic+0x43/frame 0xfffffe000857a760
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe000857a7c0
trap_pfault() at trap_pfault+0x46/frame 0xfffffe000857a810
calltrap() at calltrap+0x8/frame 0xfffffe000857a810
--- trap 0xc, rip = 0xffffffff80c27140, rsp = 0xfffffe000857a8e0, rbp = 0xfffffe000857a8f0 ---
turnstile_broadcast() at turnstile_broadcast+0x40/frame 0xfffffe000857a8f0
__mtx_unlock_sleep() at __mtx_unlock_sleep+0x73/frame 0xfffffe000857a920
pf_test() at pf_test+0xcf5/frame 0xfffffe000857aad0
pf_check_in() at pf_check_in+0x27/frame 0xfffffe000857aaf0
pfil_mbuf_in() at pfil_mbuf_in+0x38/frame 0xfffffe000857ab20
ip_input() at ip_input+0x5d5/frame 0xfffffe000857ab80
netisr_dispatch_src() at netisr_dispatch_src+0x9e/frame 0xfffffe000857abd0
ether_demux() at ether_demux+0x149/frame 0xfffffe000857ac00
ether_nh_input() at ether_nh_input+0x36a/frame 0xfffffe000857ac60
netisr_dispatch_src() at netisr_dispatch_src+0x9e/frame 0xfffffe000857acb0
ether_input() at ether_input+0x56/frame 0xfffffe000857ad00
iflib_rxeof() at iflib_rxeof+0xc1e/frame 0xfffffe000857ae00
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe000857ae40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x14e/frame 0xfffffe000857aec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xc2/frame 0xfffffe000857aef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe000857af30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000857af30
--- trap 0x55a72375, rip = 0x9406d6bd95b4e553, rsp = 0x5fe874e54796690f, rbp = 0x9ce39b38a5c54c71 ---

You can't just take revisions from OpenBSD and make them compilable on FreeBSD. OpenBSD decided to move pf out of kernel Giant lock 12 years after FreeBSD and they are doing it in a different way.

I can and I will just like others here are doing it daily it seems. And I've tested this with a customer who had constant panics before deploying this change. :)

Can you please explain how does this work? The above assertion assumes that the lock is already held. This lock also cannot recurse. I bet if you apply this patch to FreeBSD, it would compile, but the function would panic on first execution for the attempt to recursively lock non-recursive mutex.

The state is currently locked by pf_test() or pf_test6(). Once the code there hits PF_STATE_UNLOCK(s) it panics. I have hundreds of panics that attest to that spanning over many years:

Where exactly does it hit PF_STATE_UNLOCK(), what line of pf.c at what FreeBSD version?

__mtx_unlock_sleep() at __mtx_unlock_sleep+0x73/frame 0xfffffe000857a920
pf_test() at pf_test+0xcf5/frame 0xfffffe000857aad0

If you say 'list *pf_test+0xcf5' in debugger that may help to reveal the line.

Do we have anything in FreeBSD bugzilla for those hundreds panics?

You can't just take revisions from OpenBSD and make them compilable on FreeBSD. OpenBSD decided to move pf out of kernel Giant lock 12 years after FreeBSD and they are doing it in a different way.

I can and I will just like others here are doing it daily it seems. And I've tested this with a customer who had constant panics before deploying this change. :)

Sorry, "you can't" is an English phraseologism that means "it is incorrect to" or "it is not allowed to". I also hate English, and this phraseologism in particular. So literally you of course can do that.

Back to the topic. Given that you refer to customers, I am guessing you are also talking about OPNsense, and not about FreeBSD. So indeed, with your codebase you may have a lost lock, that later ends with a panic on PF_STATE_UNLOCK() somewhere. It also is entirely possible that this patch applied to OPNsense has much more sense than applied to FreeBSD.

But, please let's get back to the topic of the suggested code changes. Rephrasing my initial question: do you understand what PF_HASHROW_ASSERT(ih) at line 2731 does? Please checkout the macro implementation. Same question on PF_STATE_LOCK(), check its implementation especially the INVARIANTS version. After reading those two macro implementations, can you please explain why your patch won't bring recursion on the mutex?

vegeta_tuxpowered.net added inline comments.Fri, Mar 21, 11:14 AM

sys/netpfil/pf/pf.c
2733	PF_HASHROW_ASSERT(ih); + PF_STATE_LOCK(s); You're locking a state which supposedly has been already locked and this lock has been asserted. Have you tested this patch on a kernel with INVARIANTS?

To my understanding PF_HASHROW_LOCK() does not lock the state itself. The state is currently in use and locked by another thread which it acquired by pf_find_state() which basically does:

PF_STATE_LOCK(s);
PF_HASHROW_UNLOCK(kh);
return (s);

I'm happy about the interest here but we have to keep the code reality in mind.

In D49433#1127321, @franco_opnsense.org wrote:

To my understanding PF_HASHROW_LOCK() does not lock the state itself.

In pf_unlink_state() we are asserting lock held on ID hash row, where the state is. PF_HASHROW_LOCK() on the ID hash locks all states in the hash slot.

The state is currently in use and locked by another thread which it acquired by pf_find_state() which basically does:
PF_STATE_LOCK(s);
PF_HASHROW_UNLOCK(kh);
return (s);

The pf_find_state() uses key hash and then interlocks into the ID hash.

On a fully initialized state these two operations would lock the same lock:

PF_STATE_LOCK(s)
PF_HASHROW_LOCK(&V_pf_idhash[PF_IDHASH(s)])

I don't care about INVARIANTS and we don't use it for reasons of not being reliable.

If you want a fix that cares about ifndef INVARIANTS I can do that but your review has to be much more precise and guiding.

FWIW, OPNsense is just a widely used FreeBSD. We have to stop asserting "we" are doing "things" to the code base that don't exist. We're putting a lot of money and effort int getting stable/14 to a point where it works much more reliable than currently, but situations like this don't really help this progress.

In D49433#1127325, @franco_opnsense.org wrote:

I don't care about INVARIANTS and we don't use it for reasons of not being reliable.

If you want a fix that cares about ifndef INVARIANTS I can do that but your review has to be much more precise and guiding.

INVARIANTS is FreeBSD slang for assertions. It is a crucial part of the development process. Without assertions we won't probably be able to maintain the complex program of the FreeBSD kernel. You should care about assertions, Franco. Maybe not testing your changes with INVARIANTS is the source of hundreds of panics?

Anyway, above I mentioned INVARIANTS version of PF_STATE_LOCK() macro to help you understand the locking of pf in FreeBSD. The macro implementation explains a lot. I didn't mean that you need to fix something wrt INVARIANTS.

FWIW, OPNsense is just a widely used FreeBSD. We have to stop asserting "we" are doing "things" to the code base that don't exist. We're putting a lot of money and effort int getting stable/14 to a point where it works much more reliable than currently, but situations like this don't really help this progress.

Can you please point me at repo of OPNsense sources? It is actually your job to track differences between your tree and FreeBSD, but today I have some time to help.

Sorry, I'm just confused: is INVARIANTS in GENERIC? I'm dealing with users and real world issues here and this definitely helps with the situation. If you agree the patch works for non-INVARIANTS build than that is a good step to address the code for INVARIANTS, maybe just hiding the locks in that case if we're sure this is air-tight.

The code is at https://github.com/opnsense/core/tree/stable/25.1 and I'm trying to keep the network stack aligned with stable/14, but my problem is a partial backport strategy and slow turnaround and low interest on reports, also for https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=280701#c95 which fixes what I've previously reported. We see lots of issues with pf in particular with some fixed on main, some not.

In D49433#1127325, @franco_opnsense.org wrote:

I don't care about INVARIANTS and we don't use it for reasons of not being reliable.

INVARIANTS confirms that this patch is wrong with about 30 seconds of effort on my part, i.e., just running pf tests. You should care about it, it will save you lots of time.

panic: _mtx_lock_sleep: recursed on non-recursive mutex pf_idhash @ /home/markj/sb/main/src/sys/netpfil/pf/pf.c:2733

cpuid = 3
time = 1742562709
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0xa5/frame 0xfffffe009ab4a330
kdb_backtrace() at kdb_backtrace+0xc6/frame 0xfffffe009ab4a490
vpanic() at vpanic+0x214/frame 0xfffffe009ab4a630
panic() at panic+0xb5/frame 0xfffffe009ab4a6f0
__mtx_lock_sleep() at __mtx_lock_sleep+0x7fb/frame 0xfffffe009ab4a810
__mtx_lock_flags() at __mtx_lock_flags+0x1bc/frame 0xfffffe009ab4a8e0
pf_unlink_state() at pf_unlink_state+0x195/frame 0xfffffe009ab4a950
pf_clear_all_states() at pf_clear_all_states+0x1a4/frame 0xfffffe009ab4aa30
vnet_pf_uninit() at vnet_pf_uninit+0x1472/frame 0xfffffe009ab4ab40
vnet_destroy() at vnet_destroy+0x233/frame 0xfffffe009ab4ab90
prison_deref() at prison_deref+0x1180/frame 0xfffffe009ab4acf0
sys_jail_remove() at sys_jail_remove+0x9b/frame 0xfffffe009ab4ad10
amd64_syscall() at amd64_syscall+0x3ae/frame 0xfffffe009ab4af30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe009ab4af30
--- syscall (508, FreeBSD ELF64, jail_remove), rip = 0x1082536f6eda, rsp = 0x108250542a18, rbp = 0x108250542a90 ---

If you want a fix that cares about ifndef INVARIANTS I can do that but your review has to be much more precise and guiding.

Just look at what PF_STATE_LOCK and PF_HASHROW_ASSERT do. The patch can't be right. So if the patch is somehow fixing a race condition, then there is something deeper to be understood.

FWIW, OPNsense is just a widely used FreeBSD. We have to stop asserting "we" are doing "things" to the code base that don't exist. We're putting a lot of money and effort int getting stable/14 to a point where it works much more reliable than currently, but situations like this don't really help this progress.

Please consider how you might better help this progress. There are multiple people including myself (who's nominally on vacation) who are trying to help you for a grand total of $0.

The stack trace you pasted hints that we are unlocking a mutex we don't own. INVARIANTS makes bugs like that easier to track down. Looking through pf_unlink_state() callers, I don't see any obvious problems so far.

In D49433#1127327, @franco_opnsense.org wrote:

Sorry, I'm just confused: is INVARIANTS in GENERIC?

On main yes, on stable branches no.

I'm dealing with users and real world issues here and this definitely helps with the situation. If you agree the patch works for non-INVARIANTS build than that is a good step to address the code for INVARIANTS, maybe just hiding the locks in that case if we're sure this is air-tight.

The patch does not look right. OpenBSD uses a different locking model, and cannot be used as a source here. To understand the underlying cause, a panic report from an INVARIANTS kernel, including the actual panic string, and/or line numbers from a stack trace given by kgdb would be the first thing to look at.

I'm wondering if https://cgit.freebsd.org/src/commit/?id=38cc0bfa261 plays a role here. The move of s->timer assignment is reversed here. I'm happy to fix the locking issue in a next patch iteration. The fact remains that in an edge case the PF_STATE_UNLOCK() fails on an already freed state and this sidesteps the issue.

franco_opnsense.org planned changes to this revision.Fri, Mar 21, 7:59 PM