Inline IPSEC offload moves almost whole IPSEC processing from the CPU/MCU and possibly crypto accelerator, to the network card. The transmitted packet content is not touched by CPU during TX operations, kernel only does the required policy and security association lookups to find out that given flow is offloaded, and then packet is transmitted as plain text to the card. For driver convenience, a metadata is attached to the packet identifying SA which must process the packet. Card does encryption of the payload, padding, calculates authentication, and does the reformat according to the policy. Similarly, on receive, card does the decapsulation, decryption, and authentification. Kernel receives the identifier of SA that was used to process the packet, together with the plain-text packet. Overall, payload octets are only read or written by card DMA engine, removing a lot of memory subsystem overhead, and saving CPU time because IPSEC algos calculations are avoided. If driver declares support for inline IPSEC offload (with the IFCAP2_IPSEC_ACCEL capability set and registering method table struct if_ipsec_accel_methods), kernel offers the SPD and SAD to driver. Driver decides which policies and SAs can be offloaded based on hardware capacity, and acks/nacks each SA for given interface to kernel. Kernel needs to keep this information to make a decision to skip software processing on TX, and to assume processing already done on RX. This shadow SPD/SAD database of offloads is rooted from policies (struct secpolicy accel_ifps, struct ifp_handle_sp) and SAs (struct secasvar accel_ipfs, struct ifp_handle_sav). Some extensions to the PF_KEY socket allow to limit interfaces for which given SP/SA could be offloaded (proposed for offload). Also, additional statistics extensions allow to observe allocation/octet/use counters for specific SA. Since SPs and SAs are typically instantiated in non-sleepable context, while offloading them into card is expected to require costly async manipulations of the card state, calls to the driver for offload and termination are executed in the threaded taskqueue. It also solves the issue of allocating resources needed for the offload database. Neither ipf_handle_sp nor ipf_handle_sav do not add reference to the owning SP/SA, the offload must be terminated before last reference is dropped. ipsec_accel only adds transient references to ensure safe pointer ownership by taskqueue. Maintaining the SA counters for hardware-accelerated packets is the duty of the driver. The helper ipsec_accel_drv_sa_lifetime_update() is provided to hide accel infrastructure from drivers which would use expected callout to query hardware periodically for updates. Sponsored by: NVIDIA networking
Details
- Reviewers
rscheff - Group Reviewers
transport - Commits
- rGef2a572bf6bd: ipsec_offload: kernel infrastructure
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
Since you are touching the TCP code of the base stack, what is with the RACK stack? Doesn't it need the same modifications?
I never looked. This jumbo patch is of course split in the repository, and the part that touches TCP stack is tiny, with the only purpose to allow the TSO indicator to pass down from driver up to tcp_output() and then back to driver over ipsec_output(). Perhaps something similar can be done for RACK.
I will extract several 'interesting' patches which tweak code around, so that it is easier to discuss.
i've read through the TCP related changes, which look good to me. But that's only a minute part of this change. Others should comment on the remaining aspects of this major diff.
sys/netinet/tcp_output.c | ||
---|---|---|
913 | Maybe update the assert text depending on the HW IPSEC offload state? |
I never looked. This jumbo patch is of course split in the repository,
where can I find the individual bits if there's a split version? It would probably help review a lot if I could go through in smaller junks a time.
sys/netinet/tcp_output.c | ||
---|---|---|
913 | But what would be a condition and corresponding error message? It is always ipoptlen - ipsec_optlen != 0, which means that there are options. | |
sys/netinet/tcp_var.h | ||
43 | I am not sure, and do not see point in investigating if it is used by userspace (header is definitely used). | |
sys/netipsec/ipsec_accel.c | ||
63 | The big issue is that we have ipsec.ko and it is very ugly to define conditional source for a module Makefile depending on the kernel config. Also, the file contained some stubs for !IPSEC_ACCEL initially, then it happen to clean. |
sys/netinet/tcp_var.h | ||
---|---|---|
43 | Well, the header shall not be used. There are some known violators, they also usually define _WANT_INPCB or _WANT_TCPCB. We are aiming towards eliminating its use rather than providing accommodations for its use. So I'd rather remove these lines and I'm pretty sure that buildworld will pass. And that would be enough. |
sys/netinet/tcp_var.h | ||
---|---|---|
43 | Lets not mix the changes. Moving a struct out of userspace compilation environment is not related to the subject. We can do it later, but I do not want to do it in this scope. |
sys/netinet/tcp_var.h | ||
---|---|---|
43 | I agree - let's don't mix the changes. But lets make them in order that would not add <stdbool.h> just to be deleted on the next step. Let's do this first: https://reviews.freebsd.org/D44340 Then your bigger change. |
I went over all of this and related revisions recently. The high level comments are here and the rest will go next to the code they apply to.
- Terminology: accel vs. offload. FreeBSD seems to use "offload" for hardware assisted capabilities in the ifnet. We should try to be consistent, at least in the user visible parts, as much as possible. For example.
# man ifconfig | grep -ci offload 25 # man ifconfig | grep -ci accel 0 # grep -i offload GENERIC GENERIC:options TCP_OFFLOAD # TCP offload GENERIC:options KERN_TLS # TLS transmit & receive offload # grep -i accel GENERIC #
- I think pseudo ifnets (vlan, lagg) can also support this capability but I'm assuming that's outside the scope of this revision as there are no changes to if_vlan/if_lagg here.
- The new ifnet capability is not documented in ifconfig.8. If this cap can be enabled/disabled on the fly by the user then we'll need some way to do both of these:
- Allow existing SA/SP to be claimed by an ipsec_accel capable driver when it first starts or when the cap is first enabled.
- Allow removal of live SA/SP from the hw to force processing back into sw. The ifdown handlers in ipsec_accel.c seem to do something similar.
- The proposed data path uses mbuf tags which are allocated per-packet. If this causes any performance hit for pps workloads then maybe consider a send tag based approach where the tag is allocated once per ifnet per SA/SP. Or maybe piggy back on the existing PACKET_TAG_IPSEC_IN/OUT_DONE tags in some way?
- I'm sure this was going to soak in for a while before any planned MFC. Can we consider this interface subject to change in main until we have multiple drivers using it? I am experimenting with some if_ callbacks that provide combined SA+SP information and they may be useful for other drivers that do not want to process SA and SP separately.
sys/netipsec/ipsec.h | ||
---|---|---|
111 | secpolicy is a core netipsec data structure and it would be great to keep its size constant regardless of KERNCONF options. | |
sys/netipsec/ipsec_accel.c | ||
95 | I don't see IFP_HS_INFLUX getting set anywhere in the code. What is it for? In case it is required and is for sav requests in flux then you might want to rename it to IFP_HS_IN_FLUX ("influx" means something different than "in flux"). | |
267 | Shouldn't we should call ipsec_accel_handle_sav for all error cases? There doesn't seem to be any need to distinguish between EOPNOTSUPP and other errors. | |
300–305 | This doesn't look right. Any SA will always only be used in one direction, and the direction isn't known here. We could add an "unknown" direction and use that, with the driver free to interpret that as both directions if it wants. But it is incorrect for the infrastructure to explicitly announce that the SA will be used in both directions. On a somewhat-related note, this if_sa_newkey call is too early to be actionable for some hardware (not mlx5). Specifically, there are devices that require direction, mode (tunnel/transport), and the IP addresses involved, and not all of that information is available here. With the proposed design those drivers will have to wait for the SPs to be added and then internally figure out the information that is not explicitly provided. | |
365 | IFP_HS_REJECTED is not handled. This routine unconditionally adds the ifp to accel_ifps. | |
410 | The IFP_HS_INFLUX flag is used here but is not set anywhere. If something outside this file can set these flags then the IFP_HS_ flags need to be in a header and not in this file. | |
629–630 | ipsec_accel_remember_sp added the ifp to sp->accel_ifps and the global list but if this call fails it isn't removed from there. | |
sys/netipsec/ipsec_output.c | ||
884 | This seems unrelated to IPsec offload infrastructure. | |
sys/netipsec/key.c | ||
3286 | uma_zfree_pcpu can deal with NULL itself. | |
5525 | NULL check unnecessary. | |
sys/netipsec/keydb.h | ||
191 | secasvar is a core netipsec data structure and it would be great to keep its size constant no matter what's in the KERNCONF. | |
sys/netipsec/udpencap.c | ||
64 | This file seems to have no functional changes. |
sys/netipsec/ipsec_accel.c | ||
---|---|---|
95 | This is another remnant of the mode where SA were installed on first packet. Since parallel threads could handle the packet with same SA, parallel installation could occur. First winning thread installed placeholder ifp_handle_sav structure, that was eventually filled with real data, while other threads msleep-ed until the flag is cleared. | |
267 | I do record that driver does not support this SA offload, so EOPNOTSUPP is handled accordingly. | |
300–305 | Formally, SA can be created that can be used in any direction, by specifying corresponding networks. In the weirdest case, src == dst == 0/0 would do it. At the moment where a new key with the MATURE state appearing in kernel, I cannot know which would be a direction for SA use, so I announce both, and mlx5 indeed installs SA into both input and output steering flows. Are you suggesting to add something like IFP_HS_UKNOWN_DIR? If yes, I prefer to postpone this until the update for driver KPI is worked out. | |
365 | This is on purpose. I need to record the state of the rejected SA. Initially, with the SA offload on the first packet mode, that was critical, to avoid scheduling the task on each packet. Now it is useful. | |
629–630 | It is added with the _REJECTED flag, as intended. | |
sys/netipsec/ipsec_output.c | ||
884 | Well, it is related. Infra refs the SA, but it cannot ref the secasindex that is back-pointed by SA, and that points to secasindex. So if an SA is removed from the tree, but the structure is still kept around by the offload infra reference, we fall into this case of DEAD SA passed to the process_done(). |
Handle Navdeep' notes:
- renamed files and user-visible names from ipsec_accel to ipsec_offload. Functon names are left as is, it is too much to rename at this stage.
- removed _INFLUX
- removed some NULL checks before free
- eliminated stray diff in udpencap.c
sys/netipsec/ipsec_accel.c | ||
---|---|---|
300–305 | If the direction isn't known then the kernel should announce that it is not known, or simply not provide any direction with the SA. It can't actively provide incorrect information to the drivers by announcing that the SA's direction is both IN and OUT -- it is not, the SA will be used either in IN direction or OUT but not both. Some drivers have direction specific handling and need accurate information from the kernel. Others like mlx5 are always free to ignore the direction and setup the SA the way they prefer. Also, my reading of specs is that IPsec SAs are unidirectional, period. src = dst = 0 does not seem like a practical use case anyway and certainly not one I'd expect hardware offload to support. RFC 4301 has this in section 4.1: An SA is a simplex "connection" that affords security services to the traffic carried by it. ... To secure typical, bi-directional communication between two IPsec-enabled systems, a pair of SAs (one in each direction) is required. IKE explicitly creates SA pairs in recognition of this common usage requirement. The "glossary" in the RFC defines SA: A simplex (uni-directional) logical connection, created for security purposes. All traffic traversing an SA is provided the same security processing. In IPsec, an SA is an Internet-layer abstraction implemented through the use of AH or ESP. State data associated with an SA is represented in the SA Database (SAD). |
sys/netipsec/ipsec_accel.c | ||
---|---|---|
300–305 | For less perverted example, assume that two hosts are on 192.168.1/24, want to encap all traffic, and created SA with src == dst == 192.168.1/24. |
sys/netipsec/ipsec_accel.c | ||
---|---|---|
300–305 | Hm, I'd say if you have such a task "two hosts want encap all traffic between all addresses in 192.168.1/24 network", then you create a Security Policy (SP) that will match all addresses from this network.
So, when you are going to encrypt some packet, you need to select proper SA using src,dst,proto,spi. Then remote host can find proper SA when a packet arrives and it will be able to decrypt it. | |
546 | I think there were drv_spi of uint16_t, uint32_t, uint64_t types and now u_int :) | |
620 | It seems this printf was moved from the following function. | |
sys/netipsec/ipsec_output.c | ||
629 | It would be good to note that with IPSEC_ACCEL we have only one pass through ipsec_run_hhooks() in if_enc(4). Although it should be obvious... |
sys/netipsec/ipsec_accel.c | ||
---|---|---|
546 | It is uint64_t in the ifp_handle_sav out of necessity, because tree.h requires the key to be 64bit. Using u_int there allows to avoid the casts in several places (at least no need to shut up the compilers warnings). I made it consistent in the infra by using u_int except the place(s) where it must be uint64_t for tree.h. | |
sys/netipsec/ipsec_output.c | ||
629 | There is no documentation for IPSEC_OFFLOAD at all. Eventually it should be written, and your point is noted. |