Currently armv8crypto copies the scheme used in aesni(9), where payload
data and output buffers are allocated on the fly if the crypto buffer is
not virtually contiguous. This scheme is simple but incurs a lot of
overhead: for an encryption request with a separate output buffer we
have to
- allocate a temporary buffer to hold the payload
- copy input data into the buffer
- copy the encrypted payload to the output buffer
- zero the temporary buffer before freeing it
We have a handy crypto buffer cursor abstraction now, so reimplement the
armv8crypto routines using that instead of temporary buffers. This
introduces some extra complexity, but not a lot. The driver still
allocates an AAD buffer for AES-GCM if necessary.
Some profiling of a sendfile+KTLS workload on an Altra indicates that we
spend almost as much CPU time copying and zeroing as we do encrypting.
I am doing some profiling of ipsec on an espressobin now to see if we
get any improvements or degradations with smaller payloads.