Hey! Are you a protocol enthusiast? A security & privacy engineer or network engineer looking-in for new insights?
Did you already notice that there exist no contiguous zero-copy interface for modern encrypted transport protocol on the receive side?
So, let’s nerd on this a bit.
Hold on. What’s an encrypted transport protocol? We use this term to refer to protocols transporting encrypted upper-layer bytes that happen to also be mixed with encrypted protocol-specific control information that is not meant to be passed above on the stack abstraction. One may find such protocols implemented within kernels (e.g., Wireguard), but mainly as transport libraries (e.g., QUIC, Tor, OpenVPN, etc) providing an interface to applications to send/receive secure bytes.
That is, there exist many QUIC implementations, many VPNs designs and implementations, two Tor implementations, written by talented and experienced engineers. But not a single one of them provides a receive contiguous zero-copy interface. Note, we refer to contiguous zero-copy when a layer of abstraction in the networking stack shares a contiguous array of bytes with the upper layer, and when moving bytes from one layer to the other does not involve any copy. A well known example is io_uring for the Linux kernel, designed to replace classical socket syscalls for reading/writing on a shared contiguous buffer. The result of this engineering is higher efficiency to move bytes up or down the stack (i.e., lower CPU utilization), and a slightly more complex API flow to manipulate the shared memory space.
It turns out that, for modern encrypted transport protocols, given the current interface to symmetric encryption, similar engineering breakthrough seems impossible to achieve. But we found a way! The goal of this blogpost is to explain why we don’t have contiguous zero-copy in deployed encrypted transports, and what we need to adapt in our current approach to protocol design to make contiguous zero-copy interfaces an option for implementers of encrypted transports. We apply our findings on the QUIC protocol, so let’s have a bit of context before going into the details.
Understanding Context
Encryption was initially designed as an optional pass over the payload of existing transport protocols, such as TCP or UDP. Resulting protocols are usually backward-compatible, and do not mix control and data. More recently, pushing further for privacy, protocols such as QUIC have made mandatory the usage of encryption primitives, including encrypting control information.
Cryptography libraries
Writing efficient and secure cryptography code is a daunting task. Number of mistakes have been made in the past resulting in catastrophic failures for protocol implementations. In the paper alongside the NaCL cryptography library, Bernstein et al. argue that those problems can be under control by offering an atomic interface to cryptographic capabilities combining different security properties, such as ciphertext integrity and plaintext confidentiality, preventing non-cryptographers to combine those blocks insecurely, hoping to crunch a bit more efficiency out of them. Eventually, an atomic interface is what most cryptography libraries provide today.
The result of this endeavor is essentially protocol implementers of encrypted transports linking one of those libraries, and using one of those atomic interfaces to encrypt/decrypt their packet layout, usually in-place for cache efficiency.
HTTP/3 and QUIC’s efficiency issues
HTTP/3 defined as a semantic mapping above QUIC’s stream abstraction seems having a slow growth being the rightful replacement of HTTP/2, despite the many design enhancements. Reports of (much) lower efficiency are legion both in the academic world, and industrial integrations of QUIC implementations. Why’s that? Well, there are many factors, some of them that we previously explained in a blogpost on APNIC are essentially caused by UDP, which initially motivated our TCPLS design, and seems as well to motivate ongoing work at the IETF to have a reduced QUIC-like semantic above TCP, coined as QUIC on streams.
However, we believe one of the efficiency issues in QUIC can be solved with a slight re-design of the QUIC information layout, and bonus, this equally applies to any modern encrypted transport protocol. This problem is central to the protocol design, and fixing it would cumulate benefits to other efforts, such as UDP I/O optimizations in QUIC’s case.
Protocol Induced Copies
One problem of modern encrypted transport protocols is that their information layout creates fragments on the receive side, forcing implementations to reassemble the data through copies to eventually be able to pass contiguous data to the upper layer. Copies are CPU costly, and may contribute to cache eviction, increasing the effort to use, process or move the data. So let’s see why it happens today:
Data fragmentation leading to copies on the receive side of encrypted transports
Assume you get some fresh (encrypted) bytes from the network. These need to be decrypted, and then reassembled into a contiguous buffer. What most QUIC implementations do is the following: decrypt in place, and then keep track of all data fragments of all streams. When some stream data is in order, the application can proceed to read the stream, which usually involves the QUIC code copying the various fragments into a contiguous array provided by the reader (for a given Stream ID). This copy is fundamentally caused by the data fragments, themselves existing due to a mix of interleaved encryption of control and data information. This happens in all encrypted transports today. QUIC, TLS1.3, Tor, VPNs and whatnot.
Reverso
We want to avoid fragments. But we have constraints too:
- protocols goals and features should be unaltered.
- We follow established security practices. That, for example, rejects any approach trying to optimize-out copies by manipulating encrypted bytes with a non-atomic interface.
- Minimize changes in implementation of existing protocols.
Fragmentation can be solved by adapting any existing protocol using two protocol design principles. Note that in some cases involving simpler protocols than QUIC, only the second principle is required.
Principle 1. The order of the fields within all encrypted control chunks
are reversed. That is, if a chunk defines, say, the following three
elements: type (u8), foo (u16), bar (u64). This chunk should now be
specified as bar (u64), foo (u16), type (u8).
Principle 2. The order and number of chunks containing data within a
single encryption matter. Regarding the order, the chunk of data itself
must be the first element within the encryption, followed by its control
chunk, which we may call the data footer. Next, we may have as many
control chunks as needed (up to packet boundary).
Following those two principles would allow implementers to exploit the inherent copy happening during decryption to reassemble the packets. That is, instead of decrypting in-place, we would decrypt into the application buffer at the position next to the previous decrypted data, overriding the control information of the previous decrypted packet.
Exploiting the inherent copy in symmetric decryption to perform message reassembly
This works if, instead of usually processing information from left to right, we process the information from right to left. In practice, the size of the encryption is known, so we can jump to packet boundary and start to process control information backward. No more copy needed!
Reverso in QUIC
QUIC is a complex protocol, and many details for applying Reverso to QUIC may be found within the research paper for interested readers. In summary, much of the complexity for applying this idea to QUIC comes from the fact that packets can be decrypted in any order, and from the fact that QUIC is a multistream protocol able to multiplex stream data within the same packet. We don’t break those capabilities: we can still use multiplexing, but some of the multiplexed data won’t benefit from contiguous zero-copy. As far as our experience can tell, multiplexing provides little to no benefits for a lot of added complexity. Reverso provides further incentives to avoid it.
HTTP/3 with QUIC VReverso Microbenchmark
We wrote a new QUIC extension, coined VReverso, providing a contiguous zero-copy interface applying the mentioned insights. We wrote a HTTP/3 client and server processing logic instrumenting QUIC VReverso’s zero-copy. The following microbenchmark captures the cost of packet processing for a 2,5 MiB page split into 80 streams, and compares on different processors, and compilation options.
Microbenchmark capturing HTTP/3 processing speed for QUIC v1 and QUIC VReverso
This experiment gives you the exact cost of memory copies for QUIC V1 compared to QUIC VReverso while using HTTP/3 in the case of the Cloudflare quiche implementation (about ~38% in the receive code path). This is close to the cost of encryption, and can be fully eliminated. Note that, this cost varies from one implementation to another, depending on the choice of architecture and API. quiche by going forward with an elegant and easy to use API, on a system-independent architecture on QUIC V1, pays the price of two copies for all data. Different choices lead to different amount of copies in existing implementations, with a bare minimum of 1 data copy. However, with VReverso, we can keep the elegant API and portability benefits of a system-independent architecture and have no copies! How cute!
In conclusion, we observe that encrypted protocols cannot benefits from contiguous zero-copy due to data fragmentation caused by a dual consequence of the current protocol layout approach (i.e., header-then-data, and reading from left to right) and the current atomic interface to symmetric (authenticated) encryption. We suggest a change to the protocol layout to resolve the problem following two key principles. Eventually, this can be incrementally deployed in an existing protocol such as QUIC, since supporting the new wire format are light modifications to any existing implementation (few hundreds lines at most), and the protocol supports version negotiation. Making use of the new format for implementing a contiguous zero-copy interface would be then an option for each QUIC stack. The implementation made on quiceh, a fork of the quiche@Cloudflare QUIC stack does not add complexity to the initial API either, and a similar result should be expected for any other QUIC project.
More details, insights, and experiments are available in the paper for any interested reader.
Bonus: Protocol and the notion of Privacy
Privacy is certainly a complex notion expressing itself much differently depending on the context. In protocol design, we may define Privacy as the ability for endpoints following the protocol to express themselves without arbitrary interferences from the medium transporting the protocol information. Therefore, if the design of some protocol successfully leads to its ability to express itself (e.g., through extensions) without arbitrary interferences, we may say that the protocol design has Privacy.
That may sounds strange, but bear with me. In the case of QUIC, Privacy is gained from a set of properties and features: 1) the Frame extensibility allowing up to 2^62 different frame types, 2) Unpredictable randomization of the control information obtained from modern cryptography, and 3) confidentiality obtained from cryptography.
Together, these properties prevent incentives from the medium to interfere with the protocol, and guarantee to endpoints some level of extensibility. So indeed, the QUIC design has some level of Privacy. But careful, this does not mean that the protocol is Privacy-preserving from the user’s perspective. This is a different context! Nonetheless, from the protocol standpoint, being able to resist arbitrary interferences is a basic requirement to eventually support building Privacy-preserving systems for users. This is a breakthrough for a general-purpose transport protocol, and definitely one of the great aspects of QUIC. This is one of the reasons we use QUIC to demonstrate our research, and the reason why we can spin-up clients and servers on the Internet, and experiment with our new QUIC Version and protocol format without any issue.
The notion of Privacy is fundamental for an healthy and functioning society, as we may read in the Universal Declaration of Human Rights (art. 13). Is that surprising to find it equally fundamental for a functioning society made of machines (from a protocol standpoint)? The only difference, in my opinion, is that machines don’t have a right to Privacy. For now.. Until a real breakthrough happens in Machine Learning research, but let’s keep nightmares for the night.
Bonus2: this blogpost is accessible with VReverso!
The TCP/IP webserver is configured to send a HTTP alt-svc header indicating that this page may be fetched on port 4433 using HTTP/3. If you refresh on a modern browser, it should negotiate a QUIC connection and use HTTP/3 on QUIC v1.
If you want to try out HTTP/3 with VReverso, you may clone and compile quiceh:
git clone --recursive https://github.com/frochet/quiceh.git
cd quiceh
cargo build --release
./target/release/quiceh-client --wire-version 00791097 -- https://reverso.info.unamur.be:4433
The current wire code for this QUIC protocol is 0x00791097. This command will negotiate a QUIC connection using VReverso, then send a GET request and evetually write the GET response to stdout.