Page MenuHomeFreeBSD

mlx5e: Immediately initialize TLS send tags
ClosedPublic

Authored by gallatin on Tue, Oct 22, 9:58 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Nov 1, 2:50 PM
Unknown Object (File)
Fri, Nov 1, 2:50 PM
Unknown Object (File)
Fri, Nov 1, 2:50 PM
Unknown Object (File)
Fri, Nov 1, 2:50 PM
Unknown Object (File)
Fri, Nov 1, 2:43 PM
Unknown Object (File)
Wed, Oct 23, 11:28 PM
Subscribers

Details

Summary

Under massive connection thrashing (web server restarting), we see long periods where the web server blocks when enabling ktls offload when NIC ktls offload is enabled.

It turns out the driver uses a single-threaded linux work queue to serialize the commands that must be sent to the nic to allocate and free tls resources. When freeing sessions, this work is handled asynchronously. However, when allocating sessions, the work is handled synchronously and the driver waits for the work to complete before returning. When under massive connection thrashing, the work queue is first filled by TLS sessions closing. Then when new sessions arrive, the web server enables kTLS and blocks while the tens or hundreds of thousands of sessions closes queued up are processed by the NIC.

Rather than using the work queue to open a TLS session on the NIC, switch to doing the open directly. This allows use to cut in front of all those sessions that are waiting to close, and minimize the amount of time the web server blocks. The risk is that the NIC may be out of resources because it has not processed all of those session frees. So if we fail to open a session directly, we fall back to using the work queue.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
266
266

What happens to the KTLS state if an error occurs here? Is it stuck in limbo in some sense?

272

Fix style issue pointed out by Mark

gallatin added inline comments.
sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
266

I believe that mlx5e_sq_tls_xmit() will return an error, preventing anything from being sent. Then the connection will eventually die and the tag will be released.

I don't think this restructuring changes anything WRT error handling. It just makes the lack of error handling / retrying more obvious now.

This revision is now accepted and ready to land.Wed, Oct 23, 4:16 PM
kib added inline comments.
sys/dev/mlx5/mlx5_en/mlx5_en_hw_tls.c
338

Could you also change this to M_WAITOK, in the same or follow-up change?

This revision was automatically updated to reflect the committed changes.