if_wg: use proper barriers around pkt->p_state
Without appropriate load-synchronization to pair with store barriers in
wg_encrypt() and wg_decrypt(), the compiler and hardware are often
allowed to reorder these loads in wg_deliver_out() and wg_deliver_in()
such that we end up with a garbage or intermediate mbuf that we try to
pass on. The issue is particularly prevalent with the weaker
memory models of !x86 platforms.
Switch from the big-hammer wmb() to more explicit acq/rel atomics to
both make it obvious what we're syncing up with, and to avoid somewhat
hefty fences on platforms that don't necessarily need this.
With this patch, my dual-iperf3 reproducer is dramatically more stable
than it is without on aarch64.
PR: 264115
Reviewed by: andrew, zlei
Approved by: so
Security: FreeBSD-EN-24:06.wireguard
(cherry picked from commit 3705d679a6344c957cae7a1b6372a8bfb8c44f0e)
(cherry picked from commit 806e51f81dbae21feb6e7ddd95d2ed2a28b04f8f)