Make After-Idle congestion control work correctly for transactional sessions.
ClosedPublic
Actions

Authored by rscheff on May 26 2020, 11:41 AM.

Details

Reviewers

rrs
tuexen
rgrimes
jtl
thj
cc

Group Reviewers

transport

Commits

rS363004: MFC r362577: TCP: make after-idle work for transactional sessions.
rS362577: TCP: make after-idle work for transactional sessions.

Summary

When a certain period has passed since a TCP sender has
received network feedback on its half-connection, the
congestion window is supposed to be reset to the initial
window.

Until now, the t_rcvtime, which is updated for every
incoming segment, including pure ACKs and data, was
used as a proxy for when the last (data) transmission
was performed. This works fine for sessions doing
mostly bulk transfers in a single direction. However,
this approach fails for transactional IO, where the
server transmits large chunks of data repeatedly,
after the client requests data with a variable pause
in between requests.

In that case, the incoming request would effectively
reset t_rcvtime, and the sender would retain the last
value of its congestion window, however large that
may have been. Ultimately, this results in a large
burst of data to be transmitted blindly into the
network at wirespeed, without considering any potentially
changed network conditions. This can exacerbate any
induced packet losses significantly.

In this Diff, the existing rtt sampling mechanism is
used, to gather more appropriate timestamps of when
the last data segment was sent, and the check, if an
RTT sampling is currently runnig is moved from looking
at t_rtttime to t_rtseq.

Further, we also slightly adjust these variables, in
case they happen to be zero when a new sampling is
started.

There is a minuscle chance, that a dramatically delayed
RTT sample is collected, when a data segment happens
to end with an absolute sequence number of zero (as
that would not stop the RTT sample immediately), and at
that very moment, no further data is exchanged until a
much later time. However, this would always be a transient
effect, as sRTT and RTTvar will converge quickly to
appropriate values again, and the excessive timeout
value may not even be utilized at all either.

Reported-by: rrs

Test Plan

See attached packetdrill script

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 31314
Build 28952: arc lint + arc unit

Event Timeline

rscheff created this revision.May 26 2020, 11:41 AM

Herald added a reviewer: transport. · View Herald TranscriptMay 26 2020, 11:41 AM

Herald added a subscriber: melifaro. · View Herald Transcript

rscheff requested review of this revision.May 26 2020, 11:41 AM

Harbormaster completed remote builds in B31314: Diff 72279.May 26 2020, 11:41 AM

newreno-after-idle-server.pkt6 KBDownload

The packetdrill script newreno-after-idle-server.pkt will demonstrate the issue of the inactive (receive only) half session not resetting cwnd after-idle.

While investigating this problem futher, there is a more severe implication:

Skipping over the after_idle function in transactional sessions running Cubic, will also retain a (very old) cubic epoch time. This means, that an arbitrarily large jump in cwnd will happen, only depending on how long the client paused in performing IO requests (and at what intensity - as cwnd is only adjusted, when the sending half-session is not app-limited, but cwnd limited).

In my testing, skipping over after-idle in cubic (after a 77 sec pause in IO requests; the pause started about 1 sec after a new cubic epoch), followed by a phase of low-intensity IO (server being application limited rather cwnd limited, thus cwnd remaining untouched) caused the cubic cwnd update only after 568 sec - with a sudden jump from 36kB to 573kB (and a burst of line-rate traffic of comparable size, followed by self-inflicted loss, loss of fast retransmissions, and RTOs).

These severe implications are unlikely to have a similar devastating effect when NewReno CA very slowly grows cwnd - but the absolute-time dependency of cubic emphasises this problem.

I don't mind you tracking t_rtseq here, but please do not change the idle reduction in rack. I will go dig
up the right variable and change this to use that. Your are welcome to maintain t_rtseq but please don't change
the rack behavior here.