Background:
A user has an ongoing problem in 13.1 wherein packets occasionally
appear to get "stuck" in between two epair interface for hundreds of
milliseconds. This trips responsiveness checks in a system which
requires low latency, and tcpdump was used to verify that an epair is
the culprit.
thj@ and dch@ were able to reproduce the problem with a loop that uses
nc(1) to connect to nginx running in a jail, execute a short GET
request, then terminate the connection. It occasionally takes several
hundred milliseconds for the TCP connection to establish. This is on an
otherwise idle 32-core Epyc system; we're nowhere close to saturating
any hardware resources.
A dtrace script which measures the time elapsed between sched:::on-cpu
and sched:::off-cpu probes for the epair task thread shows the following
distribution (the magnitude of the tail is worse on 13.1 than on main):
value ------------- Distribution ------------- count 256 | 0 512 | 8586 1024 |@ 22289 2048 |@@ 74280 4096 |@@@@@@@@@@@ 427404 8192 |@@@@@@@@@@@@ 454310 16384 |@@@@@ 182130 32768 | 10542 65536 | 16049 131072 |@ 29988 262144 |@@ 57646 524288 |@@@@@@ 226848 1048576 | 43 2097152 | 0 4194304 | 0 8388608 | 1 16777216 | 0 33554432 | 60 <-- waiting for work for over 33ms 67108864 | 1 134217728 | 0
Description:
epair_transmit() does little other than to hand an mbuf chain to a
taskqueue thread. Commit 3dd5760aa5f8 ("if_epair: rework") changed the
handoff to use a pair of lockless buf_ring queues. I believe the idea
there is to try and improve scalability by having producers insert into
one ring while the consumer works on the other. Commit
66acf7685bcd ("if_epair: fix race condition on multi-core systems")
fixed a bug in this scheme which led to lost wakeups, by adding an extra
pair of flags.
I believe that transmitters can fail to wake up the taskqueue thread
even with the bug fix. In particular, it's possible for the queue ridx
to flip twice after a transmitter has set BIT_MBUF_QUEUED and loaded the
current ridx, which I believe can lead to stalled transmission since
epair_tx_start_deferred() only checks one queue at a time. It is also
hard to see whether this scheme is correct on platforms weakly-ordered
memory operations, i.e., on arm64.
The transmit path also seems rather expensive: each thread has to
execute at least three atomic instructions per packet.
Rather than complicating the transmit path further, deal with this by
using a mbufq and a mutex. The taskqueue thread can dequeue all pending
packets in an O(1) operation, and a simple state machine lets
transmitters avoid waking up the taskqueue thread more often than
necessary.
This yields a much nicer latency distribution:
value ------------- Distribution ------------- count 256 | 0 512 | 4484 1024 | 50911 2048 |@@ 204025 4096 |@@@@@@@@@@ 1351324 8192 |@@@@@@@@@@@@@ 1711604 16384 |@@@@@@ 744126 32768 | 40140 65536 | 51524 131072 |@ 82192 262144 |@ 153183 524288 |@@@@@@@ 896605 1048576 | 5 2097152 | 0
I did some sanity testing with single-stream iperf3 and don't see any
throughput degradation.