Paths

Table of Contentst

tcp: Don't try to disconnect a socket multiple times.
ClosedPublic
Actions

Authored by jhb on Oct 24 2022, 9:53 PM.

Details

Reviewers

glebius
rrs
np

Group Reviewers

transport

Commits

rGcda6bdbaa15e: tcp: Don't try to disconnect a socket multiple times.

Summary

When the checks for INP_TIMEWAIT were removed, tcp_usr_close() and
tcp_usr_disconnect() were no longer prevented from calling
tcp_disconnect() on a socket that was already disconnected. This
triggered a panic in cxgbe(4) for TOE where the tcp_disconnect() on an
already-disconnected socket invoked tcp_output() on a socket that was
already in time-wait.

Sponsored by: Chelsio Communications, Inc.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jhb created this revision.Oct 24 2022, 9:53 PM

Herald added a reviewer: transport. · View Herald TranscriptOct 24 2022, 9:53 PM

Herald added subscribers: melifaro, imp. · View Herald Transcript

jhb requested review of this revision.Oct 24 2022, 9:53 PM

Harbormaster completed remote builds in B47980: Diff 112166.Oct 24 2022, 9:53 PM

np accepted this revision.Oct 24 2022, 9:56 PM

I was able to trigger the panic in TOE reliably using openssl s_server. In this case what happened is that openssl called shutdown(2) explicitly before close(2). The call to shutdown(2) called tcp_disconnect() from tcp_usr_disconnect() and the other side responded with the FIN/ACK before the original user thread called close(2). When close(2) called tcp_usr_close() it then the socket was already in time wait. Possibly in tcp_usr_close() this could require SS_ISCONNECTED or SS_ISCONNECTING to guard the call to tcp_disconnect() instead of checking t_state?

I worry there are various other subtle bugs like this due to removing INP_TIMEWAIT checks. This one was rather non-obvious and away from the place that crashed (t4_push_frames called from tcp_output from tcp_disconnect from tcp_usr_close).

jhb added inline comments.Oct 24 2022, 10:03 PM

sys/netinet/tcp_usrreq.c
1256	I suppose actually if we were to check SS_* flags here we might not even the need the INP_DROPPED check as dropped sockets should always be marked SS_DISCONNECTED? That is, we could maybe replace this line with: if ((so->so_state & (SS_ISCONNECTING \| SS_ISCONNECTED)) != 0) { and would then not need the TCPS_TIME_WAIT check either.

I would prefer to understand and eliminate the scenario when we are calling disconnect on an already disconnected socket, instead of reintroducing this bandaid. Is there any way for me to reproduce it without Chelsio hardware?

glebius added inline comments.Oct 25 2022, 6:14 PM

sys/netinet/tcp_usrreq.c
1256	Ideally such check would require socket lock.

In D37112#842972, @glebius wrote:

I would prefer to understand and eliminate the scenario when we are calling disconnect on an already disconnected socket, instead of reintroducing this bandaid. Is there any way for me to reproduce it without Chelsio hardware?

As I said, just call shutdown(2) before close(2). This doesn't panic currently, but if you added an assertion in tcp_disconnect() that the state wasn't already TIME_WAIT it would panic today if you used 'openssl s_server'. The panic is only in the TOE code, the software stacks are probably just running unexpected code (like calling tcp_output() on a socket in TIME_WAIT).

jhb added inline comments.Oct 27 2022, 10:01 PM

sys/netinet/tcp_usrreq.c
1256	If we always hold the INP lock when calling soisdisconnecting() or soisdisconnected() then it's fine to just hold the INP lock to read the flags.

May I ask for time until Wednesday to think more about this? If it critical for further work on TOE it is fine to just commit as is.

@jhb can you please share the actual panic message and trace with TOE? Also the suggested change is actually two different changes preventing call tp tcp_usr_disconnect() in TIME-WAIT. Is there two different panic paths?

Looking at the historic TCP stack (before compressed TIME-WAIT) was added, I don't see any protection like suggested in this change. And we see that the normal non-TOE TCP stacks have no problem with this.

In D37112#845101, @glebius wrote:

@jhb can you please share the actual panic message and trace with TOE? Also the suggested change is actually two different changes preventing call tp tcp_usr_disconnect() in TIME-WAIT. Is there two different panic paths?

Looking at the historic TCP stack (before compressed TIME-WAIT) was added, I don't see any protection like suggested in this change. And we see that the normal non-TOE TCP stacks have no problem with this.

I can reproduce it if desired. The panic message will be boring. It's a page fault in t4_push_frames() due to toep->inp being NULL, so the panic message is "page fault" with an address of 0. toep->inp gets cleared to NULL in final_cpl_received() in the driver which occurs when the connection is fully closed on the card and the card sends back a message to note that it has closed the connection after receiving the FIN/ACK.

AFAIK, TOE did not exist in the tree prior to compressed TIME-WAIT. If the right approach is to make the TOE code explicitly check for TIME-WAIT, so be it, but it seems rather broken to be disconnecting an already-disconnected socket. The panic I got was in the tcp_usr_close() path because that is the one I saw (the panic was from the close() after shutdown()). I patched tcp_usr_disconnect() for completeness. However, I could probably write a test that called shutdown(2) twice in a row that would also do a double disconnect. Looking at tcp_disconenct() today, if you set SO_LINGER and then called shutdown(2) before close(2), it would call tcp_drop() twice on the same tp which would probably corrupt the hpts links due to duplicate tcp_hpts_remove() in tcp_close() as well as moving the connection from TIME_WAIT back to TCPS_CLOSED and other shenanigans. I will see if I can reproduce that issue (shouldn't be hard, just connect(), set SO_LINGER, shutdown(), then close() then wait for some unrelated panic in the future due to corrupting data structures).

May I ask for more time? I just found out that the only reason the TIME_WAIT persists correctly with the current code is that callout_drain_async() doesn't really behave as it is documented. Even when it is able to stop callout immediately, it will leave it for a drain at the scheduled time. So, when I was doing the 9c3507f91987b7c8ce3b895a5ee46a1de6bc42f2 change, I was puzzled why I can safely remove all this TIME_WAIT handling from here, as it seems right, but is never executing. Now I realized why it happened so. I'm working towards standard callout KPI for the TCP timers and this will very likely almost revert 9c3507f91987b7c8ce3b895a5ee46a1de6bc42f2 back to original Robert's code. It could be that the TOE problem will go away or it could manifest itself in a different manner.

I'm also putting that on tomorrows TCP talk agenda with tuexen, rscheff and others.

In D37112#845794, @glebius wrote:

May I ask for more time? I just found out that the only reason the TIME_WAIT persists correctly with the current code is that callout_drain_async() doesn't really behave as it is documented. Even when it is able to stop callout immediately, it will leave it for a drain at the scheduled time. So, when I was doing the 9c3507f91987b7c8ce3b895a5ee46a1de6bc42f2 change, I was puzzled why I can safely remove all this TIME_WAIT handling from here, as it seems right, but is never executing. Now I realized why it happened so. I'm working towards standard callout KPI for the TCP timers and this will very likely almost revert 9c3507f91987b7c8ce3b895a5ee46a1de6bc42f2 back to original Robert's code. It could be that the TOE problem will go away or it could manifest itself in a different manner.

Ok, sounds good to me.

FYI, I was writing KTLS tests today and happened to need a small test that did a shutdown(s, SHUT_WR) before calling close() and this reliably reproduces this panic on TOE.

Sample panic:

Kernel page fault with the following non-sleepable locks held:
exclusive rw tcpinp (tcpinp) r = 0 (0xfffff801164a6d50) locked @ /mnt/jhb/work/git/send_tags/sys/netinet/tcp_usrreq.c:1320
stack backtrace:
#0 0xffffffff80c52b45 at witness_debugger+0x65
#1 0xffffffff80c53caa at witness_warn+0x3fa
#2 0xffffffff810c8a96 at trap_pfault+0x86
#3 0xffffffff8109b5a8 at calltrap+0x8
#4 0xffffffff831de5d9 at t4_tod_output+0x69
#5 0xffffffff80ded1ba at tcp_disconnect+0xda
#6 0xffffffff80de96bb at tcp_usr_close+0x6b
#7 0xffffffff80c8e3e9 at soclose+0xe9
#8 0xffffffff80b7b7fb at _fdrop+0x1b
#9 0xffffffff80b7f04b at closef+0x1db
#10 0xffffffff80b7e6b3 at fdescfree+0x433
#11 0xffffffff80b90e6f at exit1+0x49f
#12 0xffffffff80b909cd at sys_exit+0xd
#13 0xffffffff810c93be at amd64_syscall+0x12e
#14 0xffffffff8109bebb at fast_syscall_common+0xf8


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address   = 0x60
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff831dca02
stack pointer           = 0x28:0xfffffe011869da40
frame pointer           = 0x28:0xfffffe011869daf0
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 4870 (ktls_test)
rdi: fffff800261e7000 rsi: fffff80116ee1000 rdx:                0
rcx:              4e6  r8: fffff802cf5b8740  r9: fffff802cf5b8740
rax: fffff802cf5b8740 rbx: fffff80116ee1000 rbp: fffffe011869daf0
r10:                0 r11:            10000 r12: fffff80116ee1000
r13:                0 r14:                0 r15: fffff801164a6d40
trap number             = 12
panic: page fault
cpuid = 2
time = 1668208720
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe011869d800
vpanic() at vpanic+0x151/frame 0xfffffe011869d850
panic() at panic+0x43/frame 0xfffffe011869d8b0
trap_fatal() at trap_fatal+0x409/frame 0xfffffe011869d910
trap_pfault() at trap_pfault+0xab/frame 0xfffffe011869d970
calltrap() at calltrap+0x8/frame 0xfffffe011869d970
--- trap 0xc, rip = 0xffffffff831dca02, rsp = 0xfffffe011869da40, rbp = 0xfffffe011869daf0 ---
t4_push_frames() at t4_push_frames+0x22/frame 0xfffffe011869daf0
t4_tod_output() at t4_tod_output+0x69/frame 0xfffffe011869db20
tcp_disconnect() at tcp_disconnect+0xda/frame 0xfffffe011869db50
tcp_usr_close() at tcp_usr_close+0x6b/frame 0xfffffe011869dba0
soclose() at soclose+0xe9/frame 0xfffffe011869dc00
_fdrop() at _fdrop+0x1b/frame 0xfffffe011869dc20
closef() at closef+0x1db/frame 0xfffffe011869dcb0
fdescfree() at fdescfree+0x433/frame 0xfffffe011869dd80
exit1() at exit1+0x49f/frame 0xfffffe011869ddf0
sys_exit() at sys_exit+0xd/frame 0xfffffe011869de00
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe011869df30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe011869df30
--- syscall (1, FreeBSD ELF64, exit), rip = 0x175d2d053ada, rsp = 0x175d27f41f58, rbp = 0x175d27f41f70 ---
KDB: enter: panic
[ thread pid 4870 tid 100725 ]
Stopped at      kdb_enter+0x32: movq    $0,0x12c20c3(%rip)
db> show ktr
696 (0xfffff8006d2a4740:cpu0): final_cpl_received: tid 10664, toep 0xfffff80116ee1000 (0x70a7), inp 0xfffff801164a6d40 (0x2000)
695 (0xfffff8006d2a4740:cpu0): do_peer_close: tid 10664 (FIN_WAIT_2), toep_flags 0x70a7, ddp_flags 0x0, inp 0xfffff801164a6d40
694 (0xfffff8006d2a4740:cpu0): do_rx_tls_cmp: tid 10664 rx_credits 0 rcv_wnd 65525
693 (0xfffff8006d2a4740:cpu0): do_rx_tls_cmp: tid 10664 PDU len 11 len 16 seq 671503071, rcv_nxt 671503071
692 (0xfffff8006d2a4740:cpu0): do_tls_data: tid 10664 len 11 seq 671503071
691 (0xfffff8006d2a4740:cpu0): do_close_con_rpl: tid 10664 (FIN_WAIT_1), toep_flags 0x70a7
690 (0xfffff802cf5b8740:cpu2): t4_close_conn: tid 10664

The important bit here is that events 690 and 691 are when shutdown(2) is called. The event at 695 is the peer responding with FIN/ACK. We are trying to send out data via tcp_output() after that which is what triggers the panic.

The source code of the test in question (a snippet) is pretty much what I suggested earlier for a simple test case:

static void
test_ktls_receive_bad_size(const atf_tc_t *tc, struct tls_enable *en,
    uint64_t seqno, size_t len)
{
	struct tls_record_layer *hdr;
	char *outbuf;
	size_t outbuf_len;
	ssize_t rv;
	int sockets[2];

	outbuf_len = sizeof(*hdr) + len;
	outbuf = calloc(1, outbuf_len);

	ATF_REQUIRE_MSG(open_sockets(tc, sockets), "failed to create sockets");

	ATF_REQUIRE(setsockopt(sockets[0], IPPROTO_TCP, TCP_RXTLS_ENABLE, en,
	    sizeof(*en)) == 0);
	check_tls_mode(tc, sockets[0], TCP_RXTLS_MODE);

	fd_set_blocking(sockets[0]);
	fd_set_blocking(sockets[1]);

	hdr = (void *)outbuf;
	hdr->tls_vmajor = en->tls_vmajor;
	if (en->tls_vminor == TLS_MINOR_VER_THREE)
		hdr->tls_vminor = TLS_MINOR_VER_TWO;
	else
		hdr->tls_vminor = en->tls_vminor;
	hdr->tls_type = TLS_RLTYPE_APP;
	hdr->tls_length = htons(len);

	rv = write(sockets[1], outbuf, outbuf_len);
	ATF_REQUIRE(rv == (ssize_t)outbuf_len);

	ATF_REQUIRE(shutdown(sockets[1], SHUT_WR) == 0);

	ktls_receive_tls_error(sockets[0], EMSGSIZE);
}

rrs accepted this revision.Nov 22 2022, 2:10 PM

This revision is now accepted and ready to land.Nov 22 2022, 2:10 PM

@glebius ping

Closed by commit rGcda6bdbaa15e: tcp: Don't try to disconnect a socket multiple times. (authored by jhb). · Explain WhyFeb 17 2023, 5:14 PM

This revision was automatically updated to reflect the committed changes.

jhb added a commit: rGcda6bdbaa15e: tcp: Don't try to disconnect a socket multiple times..