Page MenuHomeFreeBSD

LACP w/ short timeout erroneously declares link-flapping
ClosedPublic

Authored by rpokala on Apr 26 2022, 6:20 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jan 9, 2:07 AM
Unknown Object (File)
Sat, Dec 14, 8:50 PM
Unknown Object (File)
Sat, Dec 14, 6:36 PM
Unknown Object (File)
Wed, Dec 11, 11:09 AM
Unknown Object (File)
Oct 13 2024, 1:38 PM
Unknown Object (File)
Oct 13 2024, 1:38 PM
Unknown Object (File)
Sep 30 2024, 10:23 PM
Unknown Object (File)
Sep 30 2024, 10:18 PM

Details

Summary

Panasas was seeing a higher-than-expected number of link-flap events. After joint debugging with the switch vendor, we determined there were problems on both sides; either of which might cause the occasional event, but together caused lots of them.

On the switch side, an internal queuing issue was causing LACP PDUs -- which should be sent every second, in short-timeout mode -- to sometimes be sent slightly later than they should have been. In some cases, two successive PDUs were late, but we never saw three late PDUs in a row.

On the FreeBSD side, we saw a link-flap event every time there were two late PDUs, while the spec says that it takes *three* seconds of downtime to trigger that event. It turns out that if a PDU was received shortly before the timer code was run, it would decrement less than a full second after the PDU arrived. Then two delayed PDUs would cause two additional decrements, causing it to reach zero less than three seconds after the most-recent on-time PDU.

The solution is to note the time a PDU arrives, and only decrement if at least a full second has elapsed since then.

Test Plan

Used in conjunction with buggy switch firmware, and also debug firmware which would delay PDUs on demand. One and two dropped or delayed PDUs did not result in a link-flap event, but three dropped or delayed PDUs did.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped