[tcp:] remove incorrect reset of sack variable in prr
ClosedPublic
Actions

Authored by rscheff on Mar 5 2021, 3:13 PM.

Details

Reviewers

rrs
tuexen
jtl
gallatin

Group Reviewers

transport

Commits

rGbb60a68985c8: tcp: remove incorrect reset of SACK variable in PRR
rGd90bba73a2e4: tcp: remove incorrect reset of SACK variable in PRR
rG4a8f3aad37dd: tcp: remove incorrect reset of SACK variable in PRR

Summary

Fix sporadic panic (PR 253848) when using PRR, which can happen
when t_dupack again equals to dupthesh while the old window of
loss recovery is not completely acknowledged.

Test Plan

In the process of creating a packetdrill script to excercise the
neccessary loss recovery pattern, to run into this issue.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

rscheff created this revision.Mar 5 2021, 3:13 PM

Herald added 1 blocking reviewer(s): transport. · View Herald TranscriptMar 5 2021, 3:13 PM

Herald added subscribers: melifaro, imp. · View Herald Transcript

rscheff requested review of this revision.Mar 5 2021, 3:13 PM

Harbormaster completed remote builds in B37600: Diff 85203.Mar 5 2021, 3:14 PM

tuexen added inline comments.Mar 5 2021, 3:39 PM

sys/netinet/tcp_sack.c
835 ↗	(On Diff #85203)	Is this change part of the fix or just an unrelated cleanup (in which case it should be committed separately).

remove cleanup

Harbormaster completed remote builds in B37606: Diff 85209.Mar 5 2021, 3:42 PM

I'm fine with the change now. Just would like to know if this fixes the issue observed by rrs@.

tuexen accepted this revision as: tuexen.Mar 5 2021, 3:46 PM

yes, after this particular assignment was commented out (effectively) yesterday evening, to collect some detailed logs when this may occur, this change is the functional equivalend.

The initialization of sack_bytes_rexmit here is not really necessary - and when there are to consective loss recovery windows, old sack holes may carry over into the new recovery window (which is when the KASSERT would trigger).

While I still need to look at the side effects in detail, I expect the worst effect would be for PRR to be off in the transmission timing of a few (new) data segments, which may happen - from the instances known so far - within 3 segments / ACKs.

This seems to fix the panics that we were seeing. I used to see a panic almost instantly after reboot, as soon as the server started taking traffic. At this point, its been up for 30 minutes with no issues. Note that I'm testing the original patch.

rrs accepted this revision.Mar 5 2021, 4:30 PM

This revision is now accepted and ready to land.Mar 5 2021, 4:30 PM

I just re-tested with the latest patch (the one line removal of tp->sackhint.sack_bytes_rexmit = 0) and it seems to also be working fine.

Closed by commit rG4a8f3aad37dd: tcp: remove incorrect reset of SACK variable in PRR (authored by rscheff). · Explain WhyMar 5 2021, 5:19 PM

This revision was automatically updated to reflect the committed changes.

rscheff added a commit: rG4a8f3aad37dd: tcp: remove incorrect reset of SACK variable in PRR.

rscheff added a commit: rGd90bba73a2e4: tcp: remove incorrect reset of SACK variable in PRR.Mar 8 2021, 11:22 AM

rscheff added a commit: rGbb60a68985c8: tcp: remove incorrect reset of SACK variable in PRR.Mar 8 2021, 2:18 PM