Details

Reviewers

rgrimes
tuexen
cc
kbowling
jtl
bz
lstewart
rrs

Group Reviewers

transport

Commits

rS362586: MFC r361347: With RFC3168 ECN, CWR SHOULD only be sent with new data
rS361565: MFS r361436: MFC r361347: With RFC3168 ECN, CWR SHOULD only be sent with new…
rS361436: MFC r361347: With RFC3168 ECN, CWR SHOULD only be sent with new data
rS361347: With RFC3168 ECN, CWR SHOULD only be sent with new data

Summary

RFC3168 has a SHOULD clause to send CWR only on *new* *data* segments.

Overly conservative data receivers check this condition and ignore the
CWR flag on other packets (pure ACK and Control).

If the FreeBSD data sender sends out a pure ACK before new data, with the CWR
this can result in a race to the bottom, where the data receiver keeps
ECE latched on, and the data sender continously reducing cwnd until some
very low value.

This can be observed in particular on request/response RPC sessions, where
only minimal data is exchanged before waiting (after the TCP ACK) on the
response from the other side.

Test Plan

Will attach packetdrill script to validate that CWR is retained
until new data is being sent out.

For testing that "fresh" window probes are sent with ECT0, while repeat probes are not:

// A simple test that validates the transmission of
// new window probe data with the ECT0 codepoint
// while repeated window probes are sent without ECT

// Flush Hostcache
+0.00 `sysctl net.inet.tcp.ecn.enable=1`
+0.02 `sysctl net.inet.tcp.hostcache.purgenow=1`

// Create a connecting TCP socket.
+0.010 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.005 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0.005 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.005 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
+0     setsockopt(3, SOL_SOCKET, SO_DEBUG, [1], 4) = 0

// Establish a TCP connection
+0.035 connect(3, ..., ...) = -1 // EINPROGRESS

+0 >[noecn] SEW  0:0(0) win 65535 <mss 1460, nop,wscale 6,sackOK,TS val ... ecr 0>
+0 <[noecn] SE. 0:0(0) ack 1 win 65535 <mss 1000>
+0 >[noecn]  .   1:1(0) ack 1

+0.01 getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
+0.01 fcntl(3, F_SETFL, O_RDWR) = 0   // set back to blocking

// Send some data from server
+0.1 write(3, ..., 5000) = 5000
+0 >[ect0] .       1:1001(1000) ack 1  // we drop this packet
+0 >[ect0] .    1001:2001(1000) ack 1
+0 >[ect0] .    2001:3001(1000) ack 1
+0 >[ect0] .    3001:4001(1000) ack 1
+0 >[ect0] P.   4001:5001(1000) ack 1

// Shrink receive window to zero
+0 <[noecn] .       1:1(0) ack 5001 win 0

// and send more data
+0 write(3, ..., 1000) = 1000

// check for window probe (persist timer)
// as this is new data, ECT0 should be set
+5~+6 >[ect0] . 5001:5002(1) ack 1
//+0 %{ print "{}\t{}".format(tcpi_last_data_recv,tcpi_rcv_space) }%
+0 <[noecn] .       1:1(0) ack 5001 win 0

// check for repeated window probe
+5~+6 >[noecn] . 5001:5002(1) ack 1
+0 <[noecn] .       1:1(0) ack 5002 win 0

//this probe should be ect0 marked again
+5~+6 >[ect0]  . 5002:5003(1) ack 1
+0 <[noecn] .       1:1(0) ack 5003 win 1

// and a regular 1-byte data segment
+0~+6 >[ect0] .     5003:5004(1) ack 1
+0 <[noecn] .       1:1(0) ack 5004 win 10
+0~+6 >[ect0]  .    5004:5014(10) ack 1
+0 <[noecn] .       1:1(0) ack 5014 win 2000
+0 >[ect0]  P.    5014:6001(987) ack 1
+0 <[noecn] .       1:1(0) ack 6001 win 65535

+0.1 write(3, ..., 1) = 1

+0 >[ect0] P.     6001:6002(1) ack 1
+0 <[noecn] .     1:1(0) ack 6002 win 65535

+0.100 close(3) = 0
+0 <[noecn] F. 1:1(0) ack 6002 win 65535
+0 >[noecn] F. 6002:6002(0) ack 1
+0 <[noecn]  . 2:2(0) ack 6003 win 65535
+0 >[noecn] F. 6002:6002(0) ack 2

// Restore default sysctls
`sysctl net.inet.tcp.ecn.enable=2`

And here the script to validate that CWR is only sent on new data, and no premature CC reaction happens after a full ACK:

--tolerance_usecs=250000

// Flush Hostcache
+0.00 `sysctl net.inet.tcp.ecn.enable=1`
//+0.02 `kldload -n cc_cubic`
+0.02 `sysctl net.inet.tcp.cc.algorithm=cubic`
+0.02 `sysctl net.inet.tcp.hostcache.purgenow=1`

// Create a listening TCP socket.
 0.100 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.005 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.005 bind(3, ..., ...) = 0
+0.005 listen(3, 1) = 0

// Establish a TCP connection with ECN
+0.035 <[noecn] SEW  0:0(0) win 65535 <mss 1000, sackOK, eol, eol>
+0.000 >[noecn] SE. 0:0(0) ack 1 win 65535 <...>
+0.000 <[noecn]  . 1:1(0) ack 1 win 65535
+0.000 accept(3, ..., ...) = 4

+0 %{ print "\tcwnd\twnd \tssthresh\tNXT" }%
+0 %{ print "\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%


// Write one Initial Window of Date
+0.1 write(4, ..., 10000) = 10000
+0 >[ect0]  .     1:1001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 1001 win 65535
+0 >[ect0]  .  1001:2001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 2001 win 65535
+0 >[ect0]  .  2001:3001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 3001 win 65535
+0 >[ect0]  .  3001:4001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 4001 win 65535
+0 >[ect0]  .  4001:5001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 5001 win 65535
+0 >[ect0]  .  5001:6001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 6001 win 65535
+0 >[ect0]  .  6001:7001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 7001 win 65535
+0 >[ect0]  .  7001:8001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 8001 win 65535
+0 >[ect0]  .  8001:9001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 9001 win 65535
+0 >[ect0] P.  9001:10001(1000) ack 1
+0 %{ print "IW\t{}\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt, tcpi_snd_rexmitpack) }%

+0.01 <[noecn] E. 1:1(0) ack 10001 win 65535
+0 %{ print "ece\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%

+0.01 <[ect0]  .     1:1001(1000) ack 10001 win 65535
+0 %{ print "d1\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%
+0 <[ect0]    PE.  1001:2001(1000) ack 10001 win 65535
+0 %{ print "d2\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%

+0 >[noecn]    . 10001:10001(0) ack 2001
+0 read(4, ..., 2000) = 2000

+0.01 <[ce]    E.  2001:3001(1000) ack 10001 win 65535
+0 <[ect0]     PE.  3001:4001(1000) ack 10001 win 65535

+0 >[noecn]   E. 10001:10001(0) ack 4001
+0 %{ print "3\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%

// Test if ECE becomes unlatched on PureACK with CWR
+0.01 <[noecn] WE. 4001:4001(0) ack 10001 win 65535
+0 >[noecn]   . 10001:10001(0) ack 4001
+0 %{ print "4\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%

+0 <[ect0]  E. 4001:5001(1000) ack 10001 win 65535
+0 <[ect0]  E. 5001:6001(1000) ack 10001 win 65535

// This pure ACK shouldn't have ECE set
+0 >[noecn]   . 10001:10001(0) ack 6001
+0 %{ print "5\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%

// Write a single byte, validate CWR from IW Ack
+0.01 write(4, ..., 1) = 1
+0 >[ect0]  PW. 10001:10002(1) ack 6001
+0 %{ print "6\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%


+0.100 close(4) = 0
+0 <[noecn] F. 6001:6001(0) ack 10002 win 65535
+0 >[noecn] F. 10002:10002(0) ack 6001
+0 <[noecn] . 6002:6002(0) ack 10003 win 65535
+0 >[noecn] F. 10002:10002(0) ack 6002

// Restore default sysctls
`sysctl net.inet.tcp.ecn.enable=2; sysctl net.inet.tcp.cc.algorithm=newreno`

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

rscheff created this revision.Jan 25 2020, 6:15 PM

Herald added a reviewer: transport. · View Herald TranscriptJan 25 2020, 6:15 PM

Harbormaster completed remote builds in B28914: Diff 67290.Jan 25 2020, 6:15 PM

Merge branch 'master' into fix-3168-cwr

Harbormaster completed remote builds in B28915: Diff 67291.Jan 25 2020, 6:22 PM

rscheff added inline comments.Jan 25 2020, 6:28 PM

sys/netinet/tcp_output.c
1166 ↗	(On Diff #67291)	I don't quite understand this conditional; I assume this is to prevent window probes (which may be retransmissions) to be ECT-marked. However, this also affects persist-timer triggered 1-byte data segments which never have been sent before. While strictly speaking, this is a potential bug (delaying CWR erraneously, running into more than 1 window of ECE and an additional cwnd reduction), the impact should be fairly limited and may not warrant additional checks...

The output in Phabricator GUI looks all messed up, the raw diff looks fine though.

sys/netinet/tcp_output.c
1166 ↗	(On Diff #67291)	Perhaps the block comment above needs to be expanded, but that should be in a seperate change, IMHO.

fix snd_recover for ECN handling

Harbormaster completed remote builds in B28916: Diff 67296.Jan 25 2020, 10:22 PM

While testing this patch in detail, I found another issue: as ECN recovery doesn't really keep the snd_una to the left, the current EXIT_RECOVERY code will leave recovery on a full ack when no data was outstanding at the time, the ece was received.

With this change, the CWR will not have been sent, and the other side - when sending additional data with ECE, will cause the code to re-enter ECN recovery, for 2 back-to-back congestion reactions.

sys/netinet/tcp_output.c
1166 ↗	(On Diff #67291)	I'm trying to get a commit log from Rui Paulo, or some other validation as to what the intent has been around this condition.

add CWR-on-new-data related changes to RACK

Harbormaster completed remote builds in B28917: Diff 67297.Jan 25 2020, 10:33 PM

keep comment in RACK in-sync too

Harbormaster completed remote builds in B28918: Diff 67298.Jan 25 2020, 10:37 PM

What RFC3168 asks to do is to send out the CWR flag together with the byte at [snd_max+1]. Thus setting snd_recover to snd_max (before CWR could have arrived on the other end) can lead to premature exit of congrecovery - while the other end still has the ECE latched. That would result in 2 back to back congestion control reactions.

On top of that, the (new) data segment carrying the CWR also has the ECT codepoint, and could become CE-marked by the network. So, when the ACK to the data segment with the CWR arrives, and still has the ECE flag set, that indicates it was CE marked, and another congestion control reaction is necessary. Thus the explicit EXIT_CONGRECOVERY here, and the additional check (like later on in the regular processing) if the ACK did cover the snd_recover point....

rscheff added inline comments.Jan 26 2020, 6:50 PM

sys/netinet/tcp_output.c
1166 ↗	(On Diff #67291)	So, after some pondering about this: If (for whatever reason) cwnd ends up at zero, and the (persist) timer is clocking out new 1-byte payload segments, those should still be ECT-marked (and allowed to carry a CWR). I propose to add a check like ((tp->t_flags & TF_FORCEDATA) && len == 1 && SEQ_LT(tp->snd_una, tp->snd_max) on the basis that zero window probes don't get ACKed until the receiver window can hold them (and then, the next window probe would be a new data segment). For a normal window probe (when there is additional data to send in the socket), only the first probe would be ECT0 marked; the byte would not get acked, and all subsequent probes would be retransmissions.

also mark fresh window probes to be sent with ECT

Harbormaster completed remote builds in B28932: Diff 67328.Jan 26 2020, 8:56 PM

rscheff edited the test plan for this revision. (Show Details)Jan 26 2020, 9:05 PM

This fixes https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=243590

minor update to comments

Herald added a subscriber: melifaro. · View Herald TranscriptFeb 27 2020, 3:40 PM

Harbormaster completed remote builds in B29643: Diff 68886.Feb 27 2020, 3:40 PM

rscheff added inline comments.Feb 27 2020, 4:01 PM

sys/netinet/tcp_stacks/rack.c
1808 ↗	(On Diff #68886)	At this stage, the congestion episode was not yet finished on an ACK covering snd_recover (both loss or ECN cc-reaction). As per RFC guidance, only a single CC reaction per window is required; Since ECN and loss are tracked by two flags, we may enter (and reduce cwnd) independently. However, exit is done only when th_ack >= snd_recover - and snd_recover is pulled forward by whatever happened last. Most problematic may be non-SACK, NewReno recovery, where an ECN after loss could prolong the fast recovery episode (a subsequent loss episode may not result in another cc reaction, if overlapped by an ECN - however, there already were two cc reactions earlier).
1811 ↗	(On Diff #68886)	As CWR will be sent only on the next new data segment, the congestion epoch is over at snd_max+1 with ECN, rather than when snd_max is cumulative acked as in loss recovery.
9491 ↗	(On Diff #68886)	this is to address the case, where persist window probes are done using new 1-octet data segments.

I would like to see some "area of expertise" comments on this change before I can approve it as a mentor.

jtl added a reviewer: lstewart.Apr 23 2020, 2:36 PM

nicefy addition with whitespaces
merge large RACK commit

Harbormaster completed remote builds in B30965: Diff 71504.May 7 2020, 3:36 PM

rscheff added a subscriber: rrs.May 7 2020, 3:39 PM

rscheff added inline comments.

sys/netinet/tcp_stacks/rack.c
13522–13523 ↗	(On Diff #71504)	rS360639 had removed the simple check for window probe packets (snd_nxt == snd_max, len==1, but no forward progress). @rrs what was the motivation for that change, how do you exclude setting ECT0 on window probes in RACK?