TCP Segmentation Offload in ESXi explained

TCP Segmentation Offload (TSO) is the equivalent to TCP/IP Offload Engine (TOE) but more modeled to virtual environments, where TOE is the actual NIC vendor hardware enhancement. It is also known as Large Segment Offload (LSO). But what does it do?

When a ESXi host or a VM needs to transmit a large data packet to the network, the packet must be broken down to smaller segments that can pass all the physical switches and possible routers in the network along the way to the packet’s destination. TSO allows a TCP/IP stack to emit larger frames, even up to 64 KB, when the Maximum Transmission Unit (MTU) of the interface is configured for smaller frames. The NIC then divides the large frame into MTU-sized frames and prepends an adjusted copy of the initial TCP/IP headers. This process is referred to as segmentation.

When the NIC supports TSO, it will handle the segmentation instead of the host OS itself. The advantage being that the CPU can present up to 64 KB of data to the NIC in a single transmit-request, resulting in less cycles being burned to segment the network packet using the host CPU. To fully benefit from the performance enhancement, you must enable TSO along the complete data path on an ESXi host. If TSO is supported on the NIC it is enabled by default.

The same goes for TSO in the VMkernel layer and for the VMXNET3 VM adapter but not per se for the TSO configuration within the guest OS. To verify that your pNIC supports TSO and if it is enabled on your ESXi host, use the following command: esxcli network nic tso get. The output will look similar the following screenshot, where TSO is enabled for all available pNICs or vmnics.


Read More

Virtual Networking: Poll-mode vs Interrupt

The VMkernel is relying on the physical device, the pNIC in this case, to generate interrupts to process network I/O. This traditional style of I/O processing incurs additional delays on the entire data path from the pNIC all the way up to within guest OS. Processing I/Os using interrupt based mechanisms allows for CPU saving because multiple I/Os are combined in one interrupt. Using poll mode, the driver and the application running in the guest OS will constantly spin waiting for an I/O to be available. This way, an application can process the I/O almost instantly instead of waiting for an interrupt to occur. That will allow for lower latency and a higher Packet Per Second (PPS) rate.

An interesting fact is that the world is moving towards poll-mode drivers. A clear example of this is the NVMe driver stack.

The main drawback is that the poll-mode approach consumes much more CPU time because of the constant polling for I/O and the immediate processing. Basically, it consumes all the CPU you offer the vCPUs used for polling. Therefore, it is primarily useful when the workloads running on your VMs are extremely latency sensitive. It is a perfect fit for data plane telecom applications like a Packet GateWay (PGW) node as part of a Evolved Packet Core (EPC) in a NFV environment or other real-time latency sensitive workloads.

Using the poll-mode approach, you will need a pollmode driver in your application which polls a specific device queue for I/O. From a networking perspective, Intel’s Data Plane Development Kit (DPDK) delivers just that. You could say that the DPDK framework is a set of libraries and drivers to allow for fast network packet processing.

Data Plane Development Kit (DPDK) greatly boosts packet processingperformance and throughput, allowing more time for data plane applications. DPDK can improve packet processing performance by up to ten times. DPDK software running on current generation Intel®Xeon® Processor E5-2658 v4, achieves 233 Gbps (347 Mpps) of LLC forwarding at 64-byte packet sizes. Source:

DPDK in a VM

Using a VM with a VMXNET3 network adapter, you already have the default paravirtual network connectivity in place. The following diagram shows the default logical paravirtual device connectivity.


Read More