vMotion Deep Dive Series

In my latest blog posts we discussed the vSphere vMotion process in detail. Those blog posts are now accompanied by light board videos (including subtitles/closed captions!). Be sure to check this vMotion Deep Dive Series:

Part 1: Introduction to the vMotion process

This video will explain the basic concept of a workload live migration using vSphere vMotion. We’ll discuss in detail what processes are involved from a vCenter Server and ESXi host perspective. What happens when you initiate a vMotion? Watch the video to find out!

Part 2: vMotion Memory Copy – Under the Hood

In a previous video, we explained the vMotion process. This video will give detailed information on how vMotion keeps track of memory pages, and how memory is copied from the source to the destination ESXi host.

Part 3: vMotion Stream Architecture

This video will share details on the vMotion stream architecture. It will show what can be done to tune vMotion bandwidth utilization for high bandwidth networks (25GbE and up).

Part 4: Troubleshooting vMotion

In this video, we will explain how to troubleshoot vMotion. What logs files are used by the various processes. What to look for in those log files? Learn by watching this video.

Read More

My VMworld 2019

VMworld 2019 will mark my first edition as a VMware employee. Thus, I will be working more and have less time to attend sessions myself. However, there are always things to look forward to! As many know, it’s not only about the break-out sessions. I love visiting the Solution Expo and the bloggers area and most of all; meeting old and new friends!

Come meet me at the Meet-the-Expert tables, at the TAM customer day or during one of my break-out sessions. I will be presenting the following sessions:

Make sure to reserve your seat as soon as possible! All 3 sessions are different from each other. However, each session I get to co-present with an awesome peer. The session about vMotion will be together with one of the lead engineers on vMotion. We’ll be discussing vMotion on a deep level and talk about how to tune vMotion to saturate NICs up to 100GbE. The session will include lots of hidden gems on vMotion!

The talk about the latest server technologies is something dear to me, talking about hardware accelerations together with one of the product managers! We will go into how workloads can consume all the latest and greatest in hardware innovations.

Last but not least I will get to co-present with Johan van Amersfoort in a session that is all about a real life use-case which is a medical instance. Really interesting will be seeing the demo’s in this session in which we will show you how cancer cells can be detected in an early stage, effictivally helping the medical staff to start treatments

To Conclude

Lots of other cool sessions are going on during the VMworld week. Lost of them will be recorded but nothing beats the live interaction! Make sure you reserve your seats or join the waiting lists or queues as there is a good chance you will still make it in your favorite sessions. I would encourage you to see sessions live but to leave enough time open to mingle with your peers and to be open for meeting new people!

See you at VMworld!! Can’t wait.

Read More

The vMotion Process Under the Hood

The VMware vSphere vMotion feature is one of the most important capabilities in today’s virtual infrastructures. Since its inception in 2002 and the release in 2003, it allows us to migrate the active state of a virtual machines from one physical ESXi host to another. Today, the ability to seamlessly migrate virtual machines is an integral part of nearly every virtualization deployment. The portability of workloads is the foundation for true hybrid cloud experience by being able to move them between on-premises and public clouds using VMware Hybrid Cloud Extension (HCX). vSphere vMotion still is and always will be one of the most momentous game-changers in the IT industry.

A lot has been developed on the vMotion internals over the years to support new technologies.

This blog post will focus on compute-only migrations, which is the standard vMotion that migrates the active compute state from a source to a destination ESXi host. We also have the possibility to perform a Storage vMotion that, when combined with compute vMotion, is considered to be a XvMotion. Other flavors are Long Distance vMotion and Cross vCenter vMotion which both are primarily vCenter Server operations on top of the ESXi side of the vMotion process.

Reading this article will give you a better understanding of the ‘magic’ that is happening under the hood when you initiate a virtual machine migration.

vMotion Process

When a virtual machine migration is started, the vCenter Server instance will execute a so-called long-running migration task to process the migration. The first step is to perform a compatibility check. Is it possible to run the virtual machine on the destination host? Think about possible constraints that could prevent a live-migration. Next is to tell the source and destination ESXi hosts what is happening. A migration specification is created that contains the following information:

  • The virtual machine that is being live-migrated
  • Configuration of that virtual machine (virtual hardware, VM options, etc.)
  • Source ESXi host
  • Destination ESXi host
  • vMotion network details

The migration specification is shared with the source and destination ESXi hosts by the vCenter Server instance, making sure that all necessary information is exchanged to start the migration process. The vCenter Server communicates with ESXi hosts using the Virtual Provisioning X Daemon (VPXD) that calls out the Virtual Provisioning X Agent (VPXA) that is running on the ESXi hosts. VPXA listens to messages from VPXD, it receives the migration spec and passes that on to the VMX process via hostd. The Host Daemon (hostd) maintains host-specific information and access for management including virtual machine telemetry like the VMstate. When a migration is started, hostd puts the virtual machine in an intermediate state so the virtual machine its configuration cannot be changed during the migration.

The Virtual Machine Monitor (VMM) process is in charge of managing the virtual machine memory and transfers virtual machine storage and network I/O requests to the VMkernel. All other, non-critical to performance, I/O requests are forwarded by VMM to VMX. The Virtual Machine Extension (VMX) process runs in the VMkernel and is responsible for handling I/O to devices that are not critical to performance. Note that VMM is only used at the source ESXi hosts during the migration, because that is where the active memory of the virtual machine resides.

After this is done, the VMkernel migrate module on the source ESXi will open sockets on the vMotion enabled network to set up communication with the destination ESXi host.

Prepare Phase to Pre-Copy Phase

By now, all processes and communication paths are ready for the live migration to start. The prepare phase is all about making sure that the destination ESXi host pre-allocates the compute resources for the to-be migrated virtual machine. Also, the virtual machine is created on the destination ESXi host, but it is masked. All the information about the virtual machine configuration is already know as that is included in the migration spec.

With the prepare phase done, the process moves to the pre-copy phase where the memory is transferred from the source to the destination ESXi host. There is a need to trace all the virtual machine memory pages on the source ESXi host. By doing that, the vMotion process knows what memory pages are overwritten during migration, referred to as dirty pages, as it needs to re-send these memory pages to the destination host.

Page Tracing

During the pre-copy phase, the vCPU’s, in use by the virtual machine, are briefly stunned to install the page tracers. The VMkernel migration module now asks VMM to start the page tracing as VMM owns the memory page table state of the virtual machine. The following diagram shows what happens when the guest OS is writing data to memory during a vMotion:

Iterative Memory Pre-Copy

Page tracing is a continuous cycle. It will work towards memory pre-copy convergence by using multiple iterations. The first iteration (precopy phase-1) copies the virtual machine memory. The following iterations (precopy phase 0 to n) work on copying the dirty memory pages. To give you an example, this is what the iterations could look like as we live-migrate a virtual machine with 24GB of memory:

Phase -1:  Copy the 24GB of virtual machine memory and trace pages. As we send the memory, it dirties 8GB.
Phase 0:  Re-transmit the dirtied 8GB. In the process, the memory dirties another 3GB.
Phase 1:  Send the 3GB. While that transfer is happening, the virtual machine dirties 1GB.
Phase 2:  Send the remaining 1GB.

As the memory pages are copied from the source to the destination ESXi host, we need to determine when all memory is copied to its destination. VMM asks the VMkernel if the pre-copy process can be terminated after each iteration. This is only possible when all memory changes (dirty pages) are copied to the destination host. Part of the iterative memory pre-copy algorithm is to match all destination memory pages to its source. Starting at page zero all the way to the maximum or last memory page number, all memory pages are sequentially checked to see if the destination pages are in sync with the source pages.

To determine if we can terminate pre-copy, we need to verify whether we can complete the last memory page copy in a window of < 500ms. We can calculate this using information in the migration tax:

  • The migration transmit rate; at what speed (GbE) are we copying memory data betweens the hosts?
  • The dirty page rate (GB/s); how many memory pages are being over-written by the guest OS?
  • How many pages do we have left to transmit to the destination host?

If no, the next iteration happens. If the outcome is yes, the VMkernel migrate module will terminate the pre-copy process.

Now what will happen if the dirty page rate is higher than the migration transmit rate? If that is the case, there is no point in doing another iteration because we can never reach memory pre-copy convergence and the migration would come to a halt. This is why we introduced Stun During Page Send (SDPS) with vSphere 5.0. Basically, SDPS is a way for the VMkernel to tell VMM to not run the scheduled instructions but to introduce a really short ‘sleep’. This may sound like an impact on workload performance, but this happens at a fine-grained level. We are talking microseconds here and it is because of these very small timeframes we can converge and the vMotion process will succeed.

SDPS is executed with each iteration if the dirty page rate > transmit rate. Subsequent iterations only copy the dirty memory pages that were modified during the previous iteration. A shorter duration in iteration gives the guest OS less opportunity to modify or dirty its memory pages, thereby shortening the next pre-copy iteration. Although there is a form of performance cost involved, typically SDPS is not noticeable to the workload. The goal is always to leave the guest OS un-aware of the migration happening.


With the memory pre-copy migration terminated by VMM, all memory pages reside on the destination ESXi host. VMM now sends a remote procedure call (RPC) to VMX that it can suspend the source virtual machine. VMX will enter the checkpoint phase where it suspends the virtual machine and sends the checkpoint data to the destination ESXi host.

In the process, the virtual machine on the destination ESXi host will be de-masked, and the state is restored using the checkpoint data. What basically happens is that the virtual machine on the destination is powered on, but the boot process is interrupted and pointed to the memory pages that are migrated from the source ESXi host. All this typically happens in 100-200ms. That is the stuntime in which the virtual machine is not in running state. The duration of the stun time depends on a variety of factors like host hardware and dynamic guest workloads.

The virtual machine is now live-migrated!

To Conclude…

Although I’ve tried to explain the vMotion process in-depth, there are far more details to share on what happens in the background. I hope you appreciate this blog post. A big thank you goes out to the VMware vMotion engineering team for providing invaluable information.

If you are attending VMworld, be sure to visit the HBI2333BU – ‘How to Get the Most Out of vSphere vMotion’ session! With workloads increasingly growing, what can be done to increase vMotion performance? In this session, you will learn even more detailed information about the vMotion process, and get best practices on how to lower migration times and debugging. Find out how to tune vMotion to get line-rate performance using 100GbE NICs.

Other resources to learn

Enhanced vMotion Compatibility (EVC) Explained
VMotion, the story and confessions

Read More

Exploring the GPU Architecture

A Graphics Processor Unit (GPU) is mostly known for the hardware device used when running applications that weigh heavy on graphics, i.e. 3D modeling software or VDI infrastructures. In the consumer market, a GPU is mostly used to accelerate gaming graphics. Today, GPGPU’s (General Purpose GPU) are the choice of hardware to accelerate computational workloads in modern High Performance Computing (HPC) landscapes.

HPC in itself is the platform serving workloads like Machine Learning (ML), Deep Learning (DL), and Artificial Intelligence (AI). Using a GPGPU is not only about ML computations that require image recognition anymore. Calculations on tabular data is also a common exercise in i.e. healthcare, insurance and financial industry verticals. But why do we need a GPU for these types of all these workloads? This blogpost will go into the GPU architecture and why they are a good fit for HPC workloads running on vSphere ESXi.

Latency vs Throughput

Let’s first take a look at the main differences between a Central Processing Unit (CPU) and a GPU. A common CPU is optimized to be as quick as possible to finish a task at a as low as possible latency, while keeping the ability to quickly switch between operations. It’s nature is all about processing tasks in a serialized way. A GPU is all about throughput optimization, allowing to push as many as possible tasks through is internals at once. It does so by being able to parallel process a task. The following exemplary diagram shows the ‘core’ count of a CPU and GPU. It emphasizes that the main contrast between both is that a GPU has a lot more cores to process a task.

Differences and Similarities

However, it is not only about the number of cores. And when we speak of cores in a NVIDIA GPU, we refer to CUDA cores that consists of ALU’s (Arithmetic Logic Unit). Terminology may vary between vendors.

Looking at the overall architecture of a CPU and GPU, we can see a lot of similarities between the two. Both use the memory constructs of cache layers, memory controller and global memory. A high-level overview of modern CPU architectures indicates it is all about low latency memory access by using significant cache memory layers. Let’s first take a look at a diagram that shows an generic, memory focussed, modern CPU package (note: the precise lay-out strongly depends on vendor/model).

A single CPU package consists of cores that contains separate data and instruction layer-1 caches, supported by the layer-2 cache. The layer-3 cache, or last level cache, is shared across multiple cores. If data is not residing in the cache layers, it will fetch the data from the global DDR-4 memory. The numbers of cores per CPU can go up to 28 or 32 that run up to 2.5 GHz or 3.8 GHz with Turbo mode, depending on make and model. Caches sizes range up to 2MB L2 cache per core.

Exploring the GPU Architecture

If we inspect the high-level architecture overview of a GPU (again, strongly depended on make/model), it looks like the nature of a GPU is all about putting available cores to work and it’s less focussed on low latency cache memory access.

A single GPU device consists of multiple Processor Clusters (PC) that contain multiple Streaming Multiprocessors (SM). Each SM accommodates a layer-1 instruction cache layer with its associated cores. Typically, one SM uses a dedicated layer-1 cache and a shared layer-2 cache before pulling data from global GDDR-5 memory. Its architecture is tolerant of memory latency.

Compared to a CPU, a GPU works with fewer, and relatively small, memory cache layers. Reason being is that a GPU has more transistors dedicated to computation meaning it cares less how long it takes the retrieve data from memory. The potential memory access ‘latency’ is masked as long as the GPU has enough computations at hand, keeping it busy.

A GPU is optimized for data parallel throughput computations.

Looking at the numbers of cores it quickly shows you the possibilities on parallelism that is it is capable of.  When examining the current NVIDIA flagship offering, the Tesla V100, one device contains 80 SM’s, each containing 64 cores making a total of 5120 cores! Tasks aren’t scheduled to individual cores, but to processor clusters and SM’s. That’s how it’s able to process in parallel. Now combine this powerful hardware device with a programming framework so applications can fully utilize the computing power of a GPU.

ESXi support for GPU

VMware vSphere ESXi supports the usage of GPU’s. You will be able do dedicate a GPU device to a VM using DirectPath I/O, or assign a partitioned vGPU to a VM using the co-developed NVIDIA GRID technology or using 3rd party tooling like BitFusion. To fully understand how GPU’s are supported in vSphere ESXi and how to configure them, please review the following blog series:

To conclude

High Performance Computing (HPC) is the use of parallel processing for running advanced application programs efficiently, reliably and quickly.

This is exactly why GPU’s are a perfect fit for HPC workloads. Workloads can greatly benefit from using GPU’s as it enables them to have massive increases in throughput. A HPC platform using GPU’s will become much more versatile, flexible and efficient when running it on top of the VMware vSphere ESXi hypervisor. It allows for GPU-based workloads to allocate GPU resources in a very flexible and dynamic way.

More resources to learn

Machine Learning with GPUs on vSphere

Why the Data Scientist and Data Engineer Need to Understand Virtualization in the Cloud

Running common Machine Learning Use Cases on vSphere leveraging NVIDIA GPU

Machine Learning with H2O – the Benefits of VMware

Read More

ESXi Network Troubleshooting Tools

In the previous post about the ESXi network IOchain we explored the various constructs that belong to the network path. This blog post builds on top of that and focuses on the tools for advanced network troubleshooting and verification. Today, vSphere ESXi is packaged with a extensive toolset that helps you to check connectivity or verify bandwidth availability. Some tools are not only applicable for inside your ESXi box, but also very usable for the physical network components involved in the network paths.

Access to the ESXi shell is a necessity as the commands are executed here. A good starting point for connectivity troubleshooting is the esxtop network view. Also, the esxcli network commandlet provides a lot of information. We also have (vmk)ping, traceroute at our disposal. However, if you are required to dig deeper into an network issue, the following list of tools might help you out:

  • net-stats
  • pktcap-uw
  • nc
  • iperf


We’ll start of with one of my favorites; net-stats. This command can get you a lot of deep dive insights on what is happening under the covers of networking on a ESXi host as it can collect port stats and . The command is quite extensive as it allows for a lot of options. The net-stats -h command displays all flags. The most common one being the list option. Use net-stats -l to determine the switchport numbers and MAC addresses for all VMkernel interfaces, vmnic uplinks and vNIC ports. This information is also used for input for other tools described in the blog post.

To give some more examples, net-stats can also provide in-depth details on what worldlets (or CPU threads, listed as “sys”) are spun up for handling network IO by issuing net-stats with the following flags: net-stats -A -t vW. Output provided by these options help in verifying if NetQueue or Receive Side Scaling (RSS) is active for vmnic’s by mapping the “sys” output to the worldlet name using i.e. the vsi shell (vsish -e cat /world/<world id>/name).

Using different options, net-stats provides great insights on network behaviour.



Read More

Understanding the ESXi Network IOChain

In this blog post, we go into the trenches of the (Distributed) vSwitch with a focus on vSphere ESXi network IOChain. It is important to understand the core constructs of the vSphere networking layers for i.e. troubleshooting connectivity issues. In a second blog post on this topic, we will look closer into virtual network troubleshoot tooling.


The vSphere ESXi network IOChain is a framework that provides the capability to insert functions into the network data-path regardless of the usage of a vSphere Standard Switch (VSS) or a vSphere Distributed Switch (VDS). The IOChain is a group of functions that provides connectivity between ports and the vSwitch. A port has two IOChains based on the direction to and from the vSwitch. Meaning each port in a set is associated with it an input and an output IOChain. This allows for a modular approach by only including optional elements in an IOChain as configured by the user.

Examples of optional elements in an IOChain are VLAN support, NIC teaming, and traffic shaping. Looking at the high-level components in an ESXi network IOChain, we differentiate between the port group, the vSwitch (VSS or VDS) and the uplink level.

Port group level

This is where an optional configured VLAN is interpreted by the VLAN filter, allowing for VLAN dot1q tags for your port group. The security settings Promiscuous mode, MAC address changes, and Forged transmits are also set at the port group level. The user can also optionally configure traffic shaping, either egress only when using a VSS or bi-directional traffic shaping when using a VDS.

vSwitch (VSS or VDS) level

Incoming packets at the vSwitch level are forwarded to their destination using the forwarding engine. Incoming packets at the vSwitch level are forwarded to their destination using the forwarding engine. The forwarding engine contains port information paired with MAC address information. It’s job is to send the traffic to its proper destination. That can be either a VM residing on the same ESXi host or an external host.

The teaming engine is responsible for balancing network packets over the uplink interfaces. The way it does so is depended on the chosen teaming configuration by the user. The traffic shaper module is added to the IOChain if enabled in the port group level.

Uplink level

At this level, the traffic sent from the vSwitch to an external host finds its way to the driver module. This is where all the hardware offloading is taking place. The Supported hardware offloading features depends strongly on the physical NIC in combination with a specific driver module. Typically supported hardware offloading functions that in NICs are TCP Segment Offload (TSO), Large Receive Offload (LRO) or Checksum Offload (CSO). Network overlay protocol offloading like with VXLAN and Geneve, as used in NSX-v and NSX-T respectively, are widely supported on modern NICs.

Next to hardware offloading, the buffer mechanisms come into play in the Uplink level. I.e., when processing a burst of network packets, ring buffers come into play. Finally, the bits transmit onto the DMA controller to be handled by the CPU and physical NIC onwards to the Ethernet fabric.

Standard vSwitch

The following diagram puts all components together to form the IO chain for vSphere networking using a standard vSwitch: (more…)

Read More