Page MenuHomeFreeBSD

ipsec_accel: kernel infrastructure
ClosedPublic

Authored by kib on Mar 5 2024, 7:27 AM.
Tags
None
Referenced Files
F102872305: D44219.diff
Mon, Nov 18, 6:38 AM
F102848509: D44219.id138319.diff
Sun, Nov 17, 11:13 PM
F102848495: D44219.id135624.diff
Sun, Nov 17, 11:13 PM
F102848493: D44219.id135712.diff
Sun, Nov 17, 11:13 PM
F102848490: D44219.id135675.diff
Sun, Nov 17, 11:13 PM
F102848487: D44219.id135711.diff
Sun, Nov 17, 11:13 PM
F102848477: D44219.id140327.diff
Sun, Nov 17, 11:12 PM
Unknown Object (File)
Fri, Nov 15, 3:16 AM

Details

Summary
Inline IPSEC offload moves almost whole IPSEC processing from the
CPU/MCU and possibly crypto accelerator, to the network card.

The transmitted packet content is not touched by CPU during TX
operations, kernel only does the required policy and security
association lookups to find out that given flow is offloaded, and then
packet is transmitted as plain text to the card. For driver convenience,
a metadata is attached to the packet identifying SA which must process
the packet. Card does encryption of the payload, padding, calculates
authentication, and does the reformat according to the policy.

Similarly, on receive, card does the decapsulation, decryption, and
authentification.  Kernel receives the identifier of SA that was
used to process the packet, together with the plain-text packet.

Overall, payload octets are only read or written by card DMA engine,
removing a lot of memory subsystem overhead, and saving CPU time because
IPSEC algos calculations are avoided.

If driver declares support for inline IPSEC offload (with the
IFCAP2_IPSEC_ACCEL capability set and registering method table struct
if_ipsec_accel_methods), kernel offers the SPD and SAD to driver.

Driver decides which policies and SAs can be offloaded based on
hardware capacity, and acks/nacks each SA for given interface to
kernel.  Kernel needs to keep this information to make a decision to
skip software processing on TX, and to assume processing already done
on RX.  This shadow SPD/SAD database of offloads is rooted from
policies (struct secpolicy accel_ifps, struct ifp_handle_sp) and SAs
(struct secasvar accel_ipfs, struct ifp_handle_sav).

Some extensions to the PF_KEY socket allow to limit interfaces for
which given SP/SA could be offloaded (proposed for offload).  Also,
additional statistics extensions allow to observe allocation/octet/use
counters for specific SA.

Since SPs and SAs are typically instantiated in non-sleepable context,
while offloading them into card is expected to require costly async
manipulations of the card state, calls to the driver for offload and
termination are executed in the threaded taskqueue.  It also solves
the issue of allocating resources needed for the offload database.
Neither ipf_handle_sp nor ipf_handle_sav do not add reference to the
owning SP/SA, the offload must be terminated before last reference is
dropped.  ipsec_accel only adds transient references to ensure safe
pointer ownership by taskqueue.

Maintaining the SA counters for hardware-accelerated packets is the
duty of the driver.  The helper ipsec_accel_drv_sa_lifetime_update()
is provided to hide accel infrastructure from drivers which would use
expected callout to query hardware periodically for updates.

Sponsored by:   NVIDIA networking

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib requested review of this revision.Mar 5 2024, 7:27 AM

Since you are touching the TCP code of the base stack, what is with the RACK stack? Doesn't it need the same modifications?

Since you are touching the TCP code of the base stack, what is with the RACK stack? Doesn't it need the same modifications?

I never looked. This jumbo patch is of course split in the repository, and the part that touches TCP stack is tiny, with the only purpose to allow the TSO indicator to pass down from driver up to tcp_output() and then back to driver over ipsec_output(). Perhaps something similar can be done for RACK.

I will extract several 'interesting' patches which tweak code around, so that it is easier to discuss.

Update to integrate more bug fixes.

rscheff added a subscriber: rscheff.

i've read through the TCP related changes, which look good to me. But that's only a minute part of this change. Others should comment on the remaining aspects of this major diff.

sys/netinet/tcp_output.c
913 ↗(On Diff #135624)

Maybe update the assert text depending on the HW IPSEC offload state?

This revision is now accepted and ready to land.Mar 12 2024, 7:18 AM
In D44219#1008810, @kib wrote:

I never looked. This jumbo patch is of course split in the repository,

where can I find the individual bits if there's a split version? It would probably help review a lot if I could go through in smaller junks a time.

In D44219#1010688, @bz wrote:
In D44219#1008810, @kib wrote:

I never looked. This jumbo patch is of course split in the repository,

where can I find the individual bits if there's a split version? It would probably help review a lot if I could go through in smaller junks a time.

D44221 D44222 D44223 D44224 D44225 D44314 D44316

The rest is the monolithic netipsec/ipsec_accel.{h,c} together with hooks to call into it from the strategic places.

sys/netinet/tcp_var.h
43 ↗(On Diff #135624)

Maybe just hide struct ifcap under _KERNEL?

sys/netipsec/ipsec_accel.c
63 ↗(On Diff #135624)

What is the pros on disabling 100% of a file with preprocessor instead of just not compiling it at all with help of make glue?

kib marked 2 inline comments as done.Mar 12 2024, 9:55 PM
kib added inline comments.
sys/netinet/tcp_output.c
913 ↗(On Diff #135624)

But what would be a condition and corresponding error message? It is always ipoptlen - ipsec_optlen != 0, which means that there are options.

sys/netinet/tcp_var.h
43 ↗(On Diff #135624)

I am not sure, and do not see point in investigating if it is used by userspace (header is definitely used).

sys/netipsec/ipsec_accel.c
63 ↗(On Diff #135624)

The big issue is that we have ipsec.ko and it is very ugly to define conditional source for a module Makefile depending on the kernel config.

Also, the file contained some stubs for !IPSEC_ACCEL initially, then it happen to clean.

kib marked 2 inline comments as done.

Rebase; use t_flags2 since the last bit in t_flags was consumed.

This revision now requires review to proceed.Mar 12 2024, 9:56 PM
sys/netinet/tcp_var.h
43 ↗(On Diff #135624)

Well, the header shall not be used. There are some known violators, they also usually define _WANT_INPCB or _WANT_TCPCB. We are aiming towards eliminating its use rather than providing accommodations for its use. So I'd rather remove these lines and I'm pretty sure that buildworld will pass. And that would be enough.

sys/netinet/tcp_var.h
43 ↗(On Diff #135624)

Lets not mix the changes. Moving a struct out of userspace compilation environment is not related to the subject. We can do it later, but I do not want to do it in this scope.

sys/netinet/tcp_var.h
43 ↗(On Diff #135624)

I agree - let's don't mix the changes. But lets make them in order that would not add <stdbool.h> just to be deleted on the next step. Let's do this first:

https://reviews.freebsd.org/D44340

Then your bigger change.

Latest version after the rounds of bugfixes

Update to latest from the dev branch

I went over all of this and related revisions recently. The high level comments are here and the rest will go next to the code they apply to.

  • Terminology: accel vs. offload. FreeBSD seems to use "offload" for hardware assisted capabilities in the ifnet. We should try to be consistent, at least in the user visible parts, as much as possible. For example.
# man ifconfig | grep -ci offload
25
# man ifconfig | grep -ci accel
0
# grep -i offload GENERIC
GENERIC:options 	TCP_OFFLOAD		# TCP offload
GENERIC:options 	KERN_TLS		# TLS transmit & receive offload
# grep -i accel GENERIC
#
  • I think pseudo ifnets (vlan, lagg) can also support this capability but I'm assuming that's outside the scope of this revision as there are no changes to if_vlan/if_lagg here.
  • The new ifnet capability is not documented in ifconfig.8. If this cap can be enabled/disabled on the fly by the user then we'll need some way to do both of these:
    1. Allow existing SA/SP to be claimed by an ipsec_accel capable driver when it first starts or when the cap is first enabled.
    2. Allow removal of live SA/SP from the hw to force processing back into sw. The ifdown handlers in ipsec_accel.c seem to do something similar.
  • The proposed data path uses mbuf tags which are allocated per-packet. If this causes any performance hit for pps workloads then maybe consider a send tag based approach where the tag is allocated once per ifnet per SA/SP. Or maybe piggy back on the existing PACKET_TAG_IPSEC_IN/OUT_DONE tags in some way?
  • I'm sure this was going to soak in for a while before any planned MFC. Can we consider this interface subject to change in main until we have multiple drivers using it? I am experimenting with some if_ callbacks that provide combined SA+SP information and they may be useful for other drivers that do not want to process SA and SP separately.
sys/netipsec/ipsec.h
111

secpolicy is a core netipsec data structure and it would be great to keep its size constant regardless of KERNCONF options.

sys/netipsec/ipsec_accel.c
94 ↗(On Diff #140327)

I don't see IFP_HS_INFLUX getting set anywhere in the code. What is it for? In case it is required and is for sav requests in flux then you might want to rename it to IFP_HS_IN_FLUX ("influx" means something different than "in flux").

266 ↗(On Diff #140327)

Shouldn't we should call ipsec_accel_handle_sav for all error cases? There doesn't seem to be any need to distinguish between EOPNOTSUPP and other errors.

299–304 ↗(On Diff #140327)

This doesn't look right. Any SA will always only be used in one direction, and the direction isn't known here. We could add an "unknown" direction and use that, with the driver free to interpret that as both directions if it wants. But it is incorrect for the infrastructure to explicitly announce that the SA will be used in both directions.

On a somewhat-related note, this if_sa_newkey call is too early to be actionable for some hardware (not mlx5). Specifically, there are devices that require direction, mode (tunnel/transport), and the IP addresses involved, and not all of that information is available here. With the proposed design those drivers will have to wait for the SPs to be added and then internally figure out the information that is not explicitly provided.

364 ↗(On Diff #140327)

IFP_HS_REJECTED is not handled. This routine unconditionally adds the ifp to accel_ifps.

409 ↗(On Diff #140327)

The IFP_HS_INFLUX flag is used here but is not set anywhere. If something outside this file can set these flags then the IFP_HS_ flags need to be in a header and not in this file.

628–629 ↗(On Diff #140327)

ipsec_accel_remember_sp added the ifp to sp->accel_ifps and the global list but if this call fails it isn't removed from there.

sys/netipsec/ipsec_output.c
876

This seems unrelated to IPsec offload infrastructure.

sys/netipsec/key.c
3286

uma_zfree_pcpu can deal with NULL itself.

5524

NULL check unnecessary.

sys/netipsec/keydb.h
191

secasvar is a core netipsec data structure and it would be great to keep its size constant no matter what's in the KERNCONF.

sys/netipsec/udpencap.c
64 ↗(On Diff #140327)

This file seems to have no functional changes.

kib marked 12 inline comments as done.Jul 1 2024, 12:44 PM
kib added inline comments.
sys/netipsec/ipsec_accel.c
94 ↗(On Diff #140327)

This is another remnant of the mode where SA were installed on first packet. Since parallel threads could handle the packet with same SA, parallel installation could occur. First winning thread installed placeholder ifp_handle_sav structure, that was eventually filled with real data, while other threads msleep-ed until the flag is cleared.

266 ↗(On Diff #140327)

I do record that driver does not support this SA offload, so EOPNOTSUPP is handled accordingly.

299–304 ↗(On Diff #140327)

Formally, SA can be created that can be used in any direction, by specifying corresponding networks. In the weirdest case, src == dst == 0/0 would do it.

At the moment where a new key with the MATURE state appearing in kernel, I cannot know which would be a direction for SA use, so I announce both, and mlx5 indeed installs SA into both input and output steering flows.

Are you suggesting to add something like IFP_HS_UKNOWN_DIR? If yes, I prefer to postpone this until the update for driver KPI is worked out.

364 ↗(On Diff #140327)

This is on purpose. I need to record the state of the rejected SA.

Initially, with the SA offload on the first packet mode, that was critical, to avoid scheduling the task on each packet. Now it is useful.

628–629 ↗(On Diff #140327)

It is added with the _REJECTED flag, as intended.

sys/netipsec/ipsec_output.c
876

Well, it is related. Infra refs the SA, but it cannot ref the secasindex that is back-pointed by SA, and that points to secasindex. So if an SA is removed from the tree, but the structure is still kept around by the offload infra reference, we fall into this case of DEAD SA passed to the process_done().

kib marked 6 inline comments as done.

Handle Navdeep' notes:

  • renamed files and user-visible names from ipsec_accel to ipsec_offload. Functon names are left as is, it is too much to rename at this stage.
  • removed _INFLUX
  • removed some NULL checks before free
  • eliminated stray diff in udpencap.c
sys/netipsec/ipsec_accel.c
299–304 ↗(On Diff #140327)

If the direction isn't known then the kernel should announce that it is not known, or simply not provide any direction with the SA. It can't actively provide incorrect information to the drivers by announcing that the SA's direction is both IN and OUT -- it is not, the SA will be used either in IN direction or OUT but not both. Some drivers have direction specific handling and need accurate information from the kernel. Others like mlx5 are always free to ignore the direction and setup the SA the way they prefer.

Also, my reading of specs is that IPsec SAs are unidirectional, period. src = dst = 0 does not seem like a practical use case anyway and certainly not one I'd expect hardware offload to support.

RFC 4301 has this in section 4.1:

An SA is a simplex "connection" that affords security services to the traffic carried by it.
...
To secure typical, bi-directional communication between two IPsec-enabled systems, a pair of SAs (one in each direction) is required.  IKE explicitly creates SA pairs in recognition of this common usage requirement.

The "glossary" in the RFC defines SA:
Security Association (SA)

A simplex (uni-directional) logical connection, created for security purposes.  All traffic traversing an SA is provided the same security processing.  In IPsec, an SA is an Internet-layer  abstraction implemented through the use of AH or ESP.  State data associated with an SA is represented in the SA Database (SAD).
kib marked an inline comment as done.Jul 1 2024, 7:33 PM
kib added inline comments.
sys/netipsec/ipsec_accel.c
299–304 ↗(On Diff #140327)

For less perverted example, assume that two hosts are on 192.168.1/24, want to encap all traffic, and created SA with src == dst == 192.168.1/24.

kib marked an inline comment as done.

Remove direction from new sa driver method.

sys/netipsec/ipsec_accel.c
299–304 ↗(On Diff #140327)

Hm, I'd say if you have such a task "two hosts want encap all traffic between all addresses in 192.168.1/24 network", then you create a Security Policy (SP) that will match all addresses from this network.
And then how it will work later depends from the host configuration:

  1. Your host works as gateway, then you can use Tunnel mode SA, it will encapsulate all matched traffic into IP-IP and encrypt it. And you have two SAs - one for each direction IN and OUT and IP addresses of these SAs will be addresses of outer IP header.
  2. Your host has many IP addresses from this network prefix configured locally. And each IP address has two SA.

So, when you are going to encrypt some packet, you need to select proper SA using src,dst,proto,spi. Then remote host can find proper SA when a packet arrives and it will be able to decrypt it.

545 ↗(On Diff #138319)

I think there were drv_spi of uint16_t, uint32_t, uint64_t types and now u_int :)
Protocols use u32, I think driver probably can use u64. Maybe it would be better to use u32 in general, even if you limit its value?

619 ↗(On Diff #138319)

It seems this printf was moved from the following function.

sys/netipsec/ipsec_output.c
622

It would be good to note that with IPSEC_ACCEL we have only one pass through ipsec_run_hhooks() in if_enc(4). Although it should be obvious...

kib marked 4 inline comments as done.Jul 2 2024, 1:16 PM
kib added inline comments.
sys/netipsec/ipsec_accel.c
545 ↗(On Diff #138319)

It is uint64_t in the ifp_handle_sav out of necessity, because tree.h requires the key to be 64bit. Using u_int there allows to avoid the casts in several places (at least no need to shut up the compilers warnings).

I made it consistent in the infra by using u_int except the place(s) where it must be uint64_t for tree.h.

sys/netipsec/ipsec_output.c
622

There is no documentation for IPSEC_OFFLOAD at all. Eventually it should be written, and your point is noted.

kib marked 2 inline comments as done.

Use u_int for drv_spi

This revision was not accepted when it landed; it landed in state Needs Review.Jul 12 2024, 11:26 AM
This revision was automatically updated to reflect the committed changes.