Fix sporadic panic (PR 253848) when using PRR, which can happen
when t_dupack again equals to dupthesh while the old window of
loss recovery is not completely acknowledged.
Details
In the process of creating a packetdrill script to excercise the
neccessary loss recovery pattern, to run into this issue.
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
sys/netinet/tcp_sack.c | ||
---|---|---|
835 ↗ | (On Diff #85203) | Is this change part of the fix or just an unrelated cleanup (in which case it should be committed separately). |
I'm fine with the change now. Just would like to know if this fixes the issue observed by rrs@.
yes, after this particular assignment was commented out (effectively) yesterday evening, to collect some detailed logs when this may occur, this change is the functional equivalend.
The initialization of sack_bytes_rexmit here is not really necessary - and when there are to consective loss recovery windows, old sack holes may carry over into the new recovery window (which is when the KASSERT would trigger).
While I still need to look at the side effects in detail, I expect the worst effect would be for PRR to be off in the transmission timing of a few (new) data segments, which may happen - from the instances known so far - within 3 segments / ACKs.
This seems to fix the panics that we were seeing. I used to see a panic almost instantly after reboot, as soon as the server started taking traffic. At this point, its been up for 30 minutes with no issues. Note that I'm testing the original patch.
I just re-tested with the latest patch (the one line removal of tp->sackhint.sack_bytes_rexmit = 0) and it seems to also be working fine.