Page MenuHomeFreeBSD

epair: Simplify the transmit path and reduce tail latency
ClosedPublic

Authored by markj on Mar 1 2023, 11:55 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Oct 21, 5:28 AM
Unknown Object (File)
Mon, Oct 21, 5:28 AM
Unknown Object (File)
Mon, Oct 21, 5:28 AM
Unknown Object (File)
Mon, Oct 21, 2:50 AM
Unknown Object (File)
Sat, Oct 19, 12:37 AM
Unknown Object (File)
Sep 23 2024, 10:39 AM
Unknown Object (File)
Sep 23 2024, 4:36 AM
Unknown Object (File)
Sep 23 2024, 2:30 AM

Details

Summary

Background:
A user has an ongoing problem in 13.1 wherein packets occasionally
appear to get "stuck" in between two epair interface for hundreds of
milliseconds. This trips responsiveness checks in a system which
requires low latency, and tcpdump was used to verify that an epair is
the culprit.

thj@ and dch@ were able to reproduce the problem with a loop that uses
nc(1) to connect to nginx running in a jail, execute a short GET
request, then terminate the connection. It occasionally takes several
hundred milliseconds for the TCP connection to establish. This is on an
otherwise idle 32-core Epyc system; we're nowhere close to saturating
any hardware resources.

A dtrace script which measures the time elapsed between sched:::on-cpu
and sched:::off-cpu probes for the epair task thread shows the following
distribution (the magnitude of the tail is worse on 13.1 than on main):

    value  ------------- Distribution ------------- count
      256 |                                         0
      512 |                                         8586
     1024 |@                                        22289
     2048 |@@                                       74280
     4096 |@@@@@@@@@@@                              427404
     8192 |@@@@@@@@@@@@                             454310
    16384 |@@@@@                                    182130
    32768 |                                         10542
    65536 |                                         16049
   131072 |@                                        29988
   262144 |@@                                       57646
   524288 |@@@@@@                                   226848
  1048576 |                                         43
  2097152 |                                         0
  4194304 |                                         0
  8388608 |                                         1
 16777216 |                                         0
 33554432 |                                         60 <-- waiting for work for over 33ms
 67108864 |                                         1
134217728 |                                         0

Description:
epair_transmit() does little other than to hand an mbuf chain to a
taskqueue thread. Commit 3dd5760aa5f8 ("if_epair: rework") changed the
handoff to use a pair of lockless buf_ring queues. I believe the idea
there is to try and improve scalability by having producers insert into
one ring while the consumer works on the other. Commit
66acf7685bcd ("if_epair: fix race condition on multi-core systems")
fixed a bug in this scheme which led to lost wakeups, by adding an extra
pair of flags.

I believe that transmitters can fail to wake up the taskqueue thread
even with the bug fix. In particular, it's possible for the queue ridx
to flip twice after a transmitter has set BIT_MBUF_QUEUED and loaded the
current ridx, which I believe can lead to stalled transmission since
epair_tx_start_deferred() only checks one queue at a time. It is also
hard to see whether this scheme is correct on platforms weakly-ordered
memory operations, i.e., on arm64.

The transmit path also seems rather expensive: each thread has to
execute at least three atomic instructions per packet.

Rather than complicating the transmit path further, deal with this by
using a mbufq and a mutex. The taskqueue thread can dequeue all pending
packets in an O(1) operation, and a simple state machine lets
transmitters avoid waking up the taskqueue thread more often than
necessary.

This yields a much nicer latency distribution:

  value  ------------- Distribution ------------- count
    256 |                                         0
    512 |                                         4484
   1024 |                                         50911
   2048 |@@                                       204025
   4096 |@@@@@@@@@@                               1351324
   8192 |@@@@@@@@@@@@@                            1711604
  16384 |@@@@@@                                   744126
  32768 |                                         40140
  65536 |                                         51524
 131072 |@                                        82192
 262144 |@                                        153183
 524288 |@@@@@@@                                  896605
1048576 |                                         5
2097152 |                                         0

I did some sanity testing with single-stream iperf3 and don't see any
throughput degradation.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 50088
Build 46980: arc lint + arc unit

Event Timeline

markj requested review of this revision.Mar 1 2023, 11:55 PM

The intent was also to avoid locking between the sender and receiver side.
Re-running my test setup shows 889.672 Kpps without this change, and with it I see 42.635 Kpps. That's kind of a steep hit.

In D38843#884393, @kp wrote:

The intent was also to avoid locking between the sender and receiver side.
Re-running my test setup shows 889.672 Kpps without this change, and with it I see 42.635 Kpps. That's kind of a steep hit.

Wow, indeed. Could you please describe your test setup?

In D38843#884393, @kp wrote:

The intent was also to avoid locking between the sender and receiver side.
Re-running my test setup shows 889.672 Kpps without this change, and with it I see 42.635 Kpps. That's kind of a steep hit.

Wow, indeed. Could you please describe your test setup?

It's a slightly convoluted setup, to accommodate my limited hardware. Essentially, two machines connected with two 100G interfaces (Chelsio T62100-LP-CR). The sending machine uses netmap to throw 33Mpps of traffic in (in multiple flows, which is why I think you didn't see the problem). The other machine bridges cc0 to epair0, which sits in a vnet jail, and routes to another epair, which is bridged to cc1. So we also have multiple epair passes (as well as bridging and routing).

The traffic generator/sink does this:

#!/bin/sh

set -e
set -x

ifconfig vcc0 192.168.0.2/24 up
ifconfig vcc1 192.168.1.2/24 up

/home/kp/bin/pkt-gen -i vcc1 -f rx -w 4 -c 20 &
/home/kp/bin/pkt-gen -f tx -i vcc0 -l 100 \
    -p 2 \
    -d 192.168.1.2:50000-192.168.1.250:60000 \
    -D 02:95:fd:1a:50:0b \
    -S 00:07:43:57:a0:21 \
    -s 192.168.0.2-192.168.0.2 \
    -w 2 -c 20

and the other one does:

#!/bin/sh

bridgeA=$(ifconfig bridge create)
bridgeB=$(ifconfig bridge create)
epair_0=$(ifconfig epair create)
epair_0=${epair_0%a}
epair_1=$(ifconfig epair create)
epair_1=${epair_1%a}

jail -c name=test vnet persist vnet.interface=${epair_0}b vnet.interface=${epair_1}a
jexec test ifconfig ${epair_0}b 192.168.0.1/24 up
jexec test ifconfig ${epair_1}a 192.168.1.1/24 up
jexec test sysctl net.inet.ip.forwarding=1

ifconfig ${bridgeA} addm cc0
ifconfig ${bridgeA} addm ${epair_0}a

ifconfig cc0 up
ifconfig ${epair_0}a up
ifconfig ${bridgeA} up

ifconfig ${bridgeB} addm cc1
ifconfig ${bridgeB} addm ${epair_1}b

ifconfig cc1 up
ifconfig ${epair_1}b up
ifconfig ${bridgeB} up

# ARP
for i in `seq 2 250`
do
	jexec test arp -s 192.168.1.${i} 00:07:43:57:a0:29
done

echo "Don't forget to set the destination MAC!"
markj planned changes to this revision.Mar 2 2023, 1:18 PM
In D38843#884507, @kp wrote:
In D38843#884393, @kp wrote:

The intent was also to avoid locking between the sender and receiver side.
Re-running my test setup shows 889.672 Kpps without this change, and with it I see 42.635 Kpps. That's kind of a steep hit.

Wow, indeed. Could you please describe your test setup?

It's a slightly convoluted setup, to accommodate my limited hardware. Essentially, two machines connected with two 100G interfaces (Chelsio T62100-LP-CR). The sending machine uses netmap to throw 33Mpps of traffic in (in multiple flows, which is why I think you didn't see the problem). The other machine bridges cc0 to epair0, which sits in a vnet jail, and routes to another epair, which is bridged to cc1. So we also have multiple epair passes (as well as bridging and routing).

The traffic generator/sink does this:

Thanks! I'll try setting this up. Was it a plain GENERIC-NODEBUG config, or were you testing with options RSS enabled?

Thanks! I'll try setting this up. Was it a plain GENERIC-NODEBUG config, or were you testing with options RSS enabled?

That was a plain GENERIC-NODEBUG.

In D38843#884393, @kp wrote:

The intent was also to avoid locking between the sender and receiver side.
Re-running my test setup shows 889.672 Kpps without this change, and with it I see 42.635 Kpps. That's kind of a steep hit.

I set up a similar test using two ixgbe interfaces and see a throughput hit, but not as drastic: something like 900Kpps without the change, and 500Kpps with. (The source is transmitting roughly 3Mpps.) However, in my setup the sink is a NUMA system with both interfaces in one domain, and the throughput varies wildly depending on whether the epair task thread is scheduled in the same domain as the ix interfaces. If I pin it to the same domain, I get the numbers above; if I pin it to the "wrong" domain, throughput is significantly worse with or without the patch, but with the patch the penalty is more drastic. I speculate that that's due to bullying, since plain mutexes aren't fair and don't have any NUMA awareness.

So, the patch isn't committable as-is and I'll work on fixing it. But I wonder if your test system also has multiple NUMA domains, and if so, what numbers you get when the epair thread is pinned to the same domain as the receiving interface (e.g., with cpuset -l <cpu range> -t <tid of epair thread>).

In D38843#884393, @kp wrote:

The intent was also to avoid locking between the sender and receiver side.
Re-running my test setup shows 889.672 Kpps without this change, and with it I see 42.635 Kpps. That's kind of a steep hit.

I set up a similar test using two ixgbe interfaces and see a throughput hit, but not as drastic: something like 900Kpps without the change, and 500Kpps with. (The source is transmitting roughly 3Mpps.) However, in my setup the sink is a NUMA system with both interfaces in one domain, and the throughput varies wildly depending on whether the epair task thread is scheduled in the same domain as the ix interfaces. If I pin it to the same domain, I get the numbers above; if I pin it to the "wrong" domain, throughput is significantly worse with or without the patch, but with the patch the penalty is more drastic. I speculate that that's due to bullying, since plain mutexes aren't fair and don't have any NUMA awareness.

Oops. I think I see the main reason for the slowdown now.

In the first revision of the patch, epair_tx_start_deferred() would run until the queue is observed to be empty. If producers are busy, this can take a long time. In your benchmark setup there are two epairs, both serviced by the same taskqueue thread. When routing between the epairs, the taskqueue thread is just taking packets out of one epair queue and putting them in another, then signalling itself to process the second epair queue. So with this patch the taskqueue thread can easily starve itself and ends up dropping most of the packets.

If I fix that, things look much better, and I get a throughput improvement relative to main: 900Kpps to 980Kpps without RSS, and 1.25Mpps to 1.35Mpps with RSS. In the non-RSS case, if I pin the epair thread to the wrong domain, I get roughly the same throughput with or without the patch, 500Kpps.

Only flush the epair queue once per task. Otherwise a busy producer can cause
all other epair queue consumers to starve.

Yeah, my test box has vm.ndomains: 2.
This version is much, much faster than the pervious, and it does make the code simpler to reason about (than what we had before) as well.

This revision is now accepted and ready to land.Mar 4 2023, 12:12 PM

@kp , just out of curiosity. How many packets per second did you get in your test after this patch?

@kp , just out of curiosity. How many packets per second did you get in your test after this patch?

About 1.016 Mpps, so better than the current code.