In the early days of QUIC, many people pointed out that the UDP stack has had far far less optimization put into it than the TCP stack. Sure enough, some of the issues identified here arise because the UDP stack isn't doing things that it could do but that nobody has been motivated to make it do, such as UDP generic receive offload. Papers like this are very likely to lead to optimizations both obvious and subtle.
What is UDP offload going to do? UDP barely does anything but queue and copy.
Linux scheduling from packet-received to thread has control is not real-time, and if the CPUs are busy, may be rather slow. That's probably part of the bottleneck.
The embarrassing thing is that QUIC, even in Google's own benchmarks, only improved performance by about 10%. The added complexity probably isn't worth the trouble. However, it gave Google control of more of the stack, which may have been the real motivation.
Last I looked (several months ago), Linux's UDP stack did not seemed well tuned in its memory management accounting.
For background, the mental model of what receiving network data looks like in userspace is almost completely backwards compared to how general-purpose kernel network receive actually works. User code thinks it allocates a buffer (per-socket or perhaps a fancier io_uring scheme), then receives packets into that buffer, then processes them.
The kernel is the other way around. The kernel allocates buffers and feeds pointers to those buffers to the NIC. The NIC receives packets and DMAs them into the buffers, then tells the kernel. But the NIC and the kernel have absolutely no concept of which socket those buffers belong to until after they are DMAed into the buffers. So the kernel cannot possibly map received packets to the actual recipient's memory. So instead, after identifying who owns a received packet, the kernel retroactively charges the recipient for the memory. This happens on a per-packet basis, it involves per-socket and cgroup accounting, and there is no support for having a socket "pre-allocate" this memory in advance of receiving a packet. So the accounting is gnarly, involves atomic operations, and seems quite unlikely to win any speed awards. On a very cursory inspection, the TCP code seemed better tuned, and it possibly also won by generally handling more bytes per operation.
Keep in mind that the kernel can't copy data to application memory synchronously -- the application memory might be paged out when a packet shows up. So instead the whole charging dance above happens immediately when a packet is received, and the data is copied later on.
For quite a long time, I've thought it would be nifty if there was a NIC that kept received data in its own RAM and then allowed it to be efficiently DMAed to application memory when the application was ready for it. In essence, a lot of the accounting and memory management logic could move out of the kernel into the NIC. I'm not aware of anyone doing this.
> For quite a long time, I've thought it would be nifty if there was a NIC that kept received data in its own RAM and then allowed it to be efficiently DMAed to application memory when the application was ready for it.
I wonder if we could do a more advanced version of receive-packet steering that sufficiently identifies packets as definitely for a given process and DMAs them directly to that process's pre-provided buffers for later notification? In particular, can we offload enough information to a smart NIC that it can identify where something should be DMAed to?
Most advanced NICs support flow steering, which makes the NIC write to different buffers depending on the target port.
In practice though, you only have a limited amount of these buffers, and it causes complications if multiple processes need to consume the same multicast.
Multicast may well be shitcanned to an expensive slow path, given that multicast is rarely used for high bandwidth scenarios, especially when multiple processes need to receive the same packet.
With multiple processes listening for the data? I think that's a market niche.
In terms of billions of devices, multicast is mostly used for zero-config service discovery. I am not saying there isn't a market for high-bandwidth multicast, I am stating that for the vast majority of software deployments, multi-cast performance is not an issue. For whatever deployments it is an issue, they can specialize. And, as in the sibling comment mentions, people who need breakneck speeds have already proven that they can create a market for themselves.
I don’t think the result would be compatible with the socket or io_uring API, but maybe io_uring could be extended a bit. Basically the kernel would opportunistically program a “flow director” or similar rule to send packets to special rx queue, and that queue would point to (pinned) application memory. Getting this to be compatible with iptables/nftables would be a mess or maybe entirely impossible.
I’ve never seen the accelerated steering stuff work well in practice, sadly. The code is messy, the diagnostics are basically nonexistent, and it’s not clear to me that many drivers support it well.
Of course you're going to get horrible latency because of speed-of-light limitations, so the definition of "work" may be weak, but data should be able to be transmitted.
GPUDirect relies on the PeerDirect extensions for RDMA and are thus an extension to the RDMA verbs, not a separate an independent thing that works without RDMA.
You can read/write to GPU buffers with gpudev in DPDK yes. It also uses some of the infrastructure that powers GPUDirect (namely the page pinning and address translation). Because you can use the addressable memory in DPDK buffer steering you can have the NIC DMA to/from the GPU and then have a GPU kernel coordinate with your DPDK application. This will be pretty fast on a good lossless datacentre network but probably pretty awful over the Internet. In the DC though it will be beaten by real GPUDirect on RDMA naturally as you don't need the DPDK coordinator and all tx/rx can be driven by the GPU kernel instead.
This isn't GPUDirect though, that is an actual product.
This is GPUDirect. GPUDirect is the technology that enables any third-party device to talk to a GPU (like a NIC).
> but probably pretty awful over the Internet. In the DC though it will be beaten by real GPUDirect on RDMA naturally
It's being used in many places successfully over the internet. RDMA is fine, but completely breaks the abstraction of services. In many places you do not want to know who is sending or what address to send/receive.
Why don't we eliminate the initial step of an app reserving a buffer, keep each packet in its own buffer, and once the socket it belongs to is identified hand a pointer and ownership of that buffer back to the app? If buffers can be of fixed (max) size, you could still allow the NIC to fill a bunch of them in one go.
Presuming that this is a server that has One (public) Job, couldn't you:
1. dedicate a NIC to the application;
2. and have the userland app open a packet socket against the NIC, to drink from its firehose through MMIO against the kernel's own NIC DMA buffer;
...all without involving the kernel TCP/IP (or in this case, UDP/IP) stack, and any of the accounting logic squirreled away in there?
(You can also throw in a BPF filter here, to drop everything except UDP packets with the expected specified ip:port — but if you're already doing more packet validation at the app level, you may as well just take the whole firehose of packets and validate them for being targeted at the app at the same time that they're validated for their L7 structure.)
I think DPDK does something like this. The NIC is programmed to aim the packets in question at a specific hardware receive queue, and that queue is entirely owned by a userspace program.
A lot of high end NICs support moderately complex receive queue selection rules.
I mean, under the scheme I outlined, the kernel is still going to do that by default. It's not that the NIC's driver is overridden or anything; the kernel would still be reading the receive buffer from this NIC and triggering per-packet handling — and thus triggering default kernel response-handling where applicable (and so responding to e.g. ICMP ARP messages correctly.)
The only thing that's different here, is that there are no active TCP or UDP listening sockets bound to the NIC — so when the kernel is scanning the receive buffer to decide what to do with packets, and it sees a TCP or UDP packet, it's going to look at its connection-state table for that protocol+interface, realize it's empty, and drop the packet for lack of consumer, rather than doing any further logic to it. (It'll bump the "dropped packets" counter, I suppose, but that's it.)
But, since there is a packet socket open against the NIC, then before it does anything with the packet, it's going to copy every packet it receives into that packet socket's (userspace-shared) receive-buffer mmap region.
- 64 packets per syscall, which is enough data to amortize the syscall overhead - a single packet is not.
- UDP offload optionally lets you defer checksum computation, often offloading it to hardware.
- UDP offload lets you skip/reuse route lookups for subsequent packets in a bundle.
What UDP offload is no good for though, is large scale servers - the current APIs only work when the incoming packet chains neatly organize into batches per peer socket. If you have many thousands of active sockets you’ll stop having full bundles and the overhead starts sneaking back in. As I said in another thread, we really need a replacement for the BSD APIs here, they just don’t scale for modern hardware constraints and software needs - much too expensive per packet.
In my head the main benefit of QUIC was always multipath, aka the ability to switch interfaces on demand without losing the connection. There's MPTCP but who knows how viable it is.
I always thought the main benefit of QUIC was to encrypt the important part of the transport header, so endpoints control their own destiny, not some middle device.
If I had a dollar for every firewall vendor who thought dropping TCP retransmissions or TCP Reset was a good idea...
It requires explicit backend support, and Apple supports it for many of their services, but I've never seen another public API that does. Anyone have any examples?
Last I looked into this (many years), ELB/GLBs didn't support it on AWS/GCP respectively. That prevented us from further considering implementing it at the time (mobile app -> AWS-hosted EC2 instances behind an ELB).
Not sure if that's changed, but at the time it wasn't worth having to consider rolling our own LBs.
To answer your original question, no, I haven't (knowingly) seen it on any public APIs.
Among other things, GRO (receive offloading) means you can get more data off of the network card in fewer operations.
Linux has receive packet steering, which can help with getting packets from the network card to the right CPU and the right userspace thread without moving from one CPU's cache to another.
You mean Receive Flow Steering, and RFS can only control RPS, so to do it in hardware you actually mean Accelerated RFS (which requires a pretty fancy NIC these days).
Even ignoring the hardware requirement, unfortunately it's not that simple. I find results vary wildly whether you should put process and softirq on the same CPU core (sharing L1 and L2) or just on the same CPU socket (sharing L3 but don't constantly blow out L1/L2).
Eric Dumazet said years ago at a Netdev.conf that L1 cache sizes have really not kept up with reality. That matches my experience.
QUIC doing so much in userspace adds another class of application which has a so-far uncommon design pattern.
I don't think it's possible to say whether any QUIC application benefits from RFS or not.
Handling ACK packets in kernelspace would be one thing - helping for example RTT estimation. With userspace stack ACK's are handled in application and are subject to scheduler, suffering a lot on a loaded system.
There are no ACKs inherent in the UDP protocol, so "UDP offload" is not where the savings are.
There are ACKs in the QUIC protocol and they are carried by UDP datagrams which need to make their way up to user land to be processed, and this is the crux of the issue.
What is needed is for QUIC offload to be invented/supported by HW so that most of the high-frequency/tiny-packet processing happens there, just as it does today for TCP offload. TCP large-send and large-receive offload is what is responsible for all the CPU savings as the application deals in 64KB or larger send/receives and the segmentation and receive coalescing all happen in hardware before an interrupt is even generated to involve the kernel, let alone userland.
Bulk throughout isn't on par with TLS mainly because NICs with dedicated hardware for QUIC offload aren't commercially available (yet). Latency is undoubtedly better - the 1-RTT QUIC handshake substantially reduces time-to-first-byte compared to TLS.
I think one of the original drivers was the ability to quickly tweak parameters, after Linux rejected what I think was userspace adjustment of window sizing to be more aggressive than the default.
The Linux maintainers didn't want to be responsible for congestion collapse, but UDP lets you spray packets from userspace, so Google went with that.
The solution isn't in more UDP offload optimizations as there aren't any semantics in UDP that are expensive other than the quantity and frequency of datagrams to be processed in the context of the QUIC protocol that uses UDP as a transport. QUIC's state machine needs to see every UDP datagram carrying QUIC protocol messages in order to move forward. Just like was done for TCP offload more than twenty years ago, portions of QUIC state need to move and be maintained in hardware to prevent the host from having to see so many high-frequency tiny state updates messages.
Unless I’m missing something here, pretty much any Intel nic released in the past decade should support tcp offload. I imagine the same is true for Broadcom and other vendors as well, but I don’t have something handy to check.
> Which end-user network cards that I can buy can do TCP offloading?
Intel's I210 controllers support offloading:
> Other performance-enhancing features include IPv4 and IPv6 checksum offload, TCP/UDP checksum offload, extended Tx descriptors for more offload capabilities, up to 256 KB TCP segmentation (TSO v2), header splitting, 40 KB packet buffer size, and 9.5 KB Jumbo Frame support.
Practically every on-board network adapter I've had for over a decade has had TCP offload support. Even the network adapter on my cheap $300 Walmart laptop has hardware TCP offload support.
The whole reason QUIC even exists in user space is because its developers were trying to hack a quick speed-up to HTTP rather than actually do the work to improve the underlying networking fundamentals. In this case the practicalities seem to have caught them out.
If you want to build a better TCP, do it. But hacking one in on top of UDP was a cheat that didn’t pay off. Well, assuming performance was even the actual goal.
It already exists, it's called SCTP. It doesn't work over the Internet because there's too much crufty hardware in the middle that will drop it instead of routing it. Also, Microsoft refused to implement it in Windows and also banned raw sockets so it's impossible to get support for it on that platform without custom drivers that practically nobody will install.
I don't know how familiar the developers of QUIC were with SCTP in particular but they were definitely aware of the problems that prevented a better TCP from existing. The only practical solution is to build something on top of UDP, but if even that option proves unworkable, then the only other possibility left is to fragment the Internet.
I like (some aspects of) SCTP too but it's not a solution to this problem.
If you've followed Dave Taht's bufferbloat stuff, the reason he lost faith in TCP is because middle devices have access to the TCP header and can interfere with it.
If SCTP got popular, then middle devices would ruin SCTP in the same way.
QUIC is the bufferbloat preferred solution because the header is encrypted. It's not possible for a middle device to interfere with QUIC. Endpoints, and only endpoints, control their own traffic.
They couldn't have built it on anything but UDP because the world is now filled with poorly designed firewall/NAT middleboxes which will not route things other than TCP, UDP and optimistically ICMP.
counterpoint, it is paying off, just taking a while. this paper wasn't "quick is bad" it was "OSes need more optimization for quick to be as fast as https"
I think this is slightly wrong. the goal was faster without requiring the OS/middleware support. optimizing the OSes that need high performance is much easier since that's a much smaller set of OSes (basically just Linux/Mac/windows)
Yeah they probably wanted a protocol that would actually work on the wild internet with real firewalls and routers and whatnot. The only option if you want that is building on top of UDP or TCP and you obviously can't use TCP.
Your first point is correct - papers ideally lead to innovation and tangible software improvements.
I think a kernel implementation of QUIC is the next logical step. A context switch to decrypt a packet header and send control traffic is just dumb. That's the kernel's job.
Userspace network stacks have never been a good idea. QUIC is no different.
(edit: Xin Long already has started a kernel implementation, see elsewhere on this page)