How to Tune vMotion for Lower Migration Times?

In an earlier blog post, The vMotion Process Under the Hood, we went over the vMotion process internals. Now that we understand how vMotion works, lets go over some of the options we have today to lower migration times even further. When enabled, by default, vMotion works beautifully. However, with high bandwidth networks quickly becoming mainstream, what can we do to fully take advantage of 25, 40 or even 100GbE NICs? This blog post goes into details on vMotion tunables and how they can help to optimize the process.

Streams and Threads

To understand how we can tune vMotion performance and thereby lower live-migration times, we first need to understand the concept of vMotion streams.The streaming architecture in vMotion was first introduced in vSphere 4.1 and has been developed and improved ever since. One of the prerequisites for performing live-migrations is to have a vMotion network configured. As part of the enabling vMotion, you need at least one VMkernel interface that is enabled for vMotion traffic on your applicable ESXi hosts.

When you have a VMkernel interface that is enabled for vMotion, a single vMotion stream is instantiated. One vMotion stream contains three helpers: (more…)

Read More

My VMworld 2019

VMworld 2019 will mark my first edition as a VMware employee. Thus, I will be working more and have less time to attend sessions myself. However, there are always things to look forward to! As many know, it’s not only about the break-out sessions. I love visiting the Solution Expo and the bloggers area and most of all; meeting old and new friends!

Come meet me at the Meet-the-Expert tables, at the TAM customer day or during one of my break-out sessions. I will be presenting the following sessions:

(more…)

Read More

The vMotion Process Under the Hood

The VMware vSphere vMotion feature is one of the most important capabilities in today’s virtual infrastructures. Since its inception in 2002 and the release in 2003, it allows us to migrate the active state of a virtual machines from one physical ESXi host to another. Today, the ability to seamlessly migrate virtual machines is an integral part of nearly every virtualization deployment. The portability of workloads is the foundation for true hybrid cloud experience by being able to move them between on-premises and public clouds using VMware Hybrid Cloud Extension (HCX). vSphere vMotion still is and always will be one of the most momentous game-changers in the IT industry.

A lot has been developed on the vMotion internals over the years to support new technologies.

(more…)

Read More

Enhanced vMotion Compatibility (EVC) Explained

–originally authored and posted by me at https://blogs.vmware.com/vsphere/2020/03/how-is-virtual-memory-translated-to-physical-memory.html–

vSphere Enhanced vMotion Compatibility (EVC) ensures that workloads can be live migrated, using vMotion, between ESXi hosts in a cluster that is running different CPU generations. The general recommendation is to have EVC enabled as it will help you in the future when you’ll be scaling your clusters with new hosts that might contain new CPU models. To enable EVC in a brownfield scenario can be challenging.  That’s why we stress to have it enabled from the get-go. This blog post will go into detail about EVC and the per-VM EVC feature.

How does EVC work?

The way EVC allows for uniform vMotion compatibility is by enforcing a CPUID (instruction) baseline for the virtual machines running on the ESXi hosts. That means EVC will allow and expose CPU instruction-sets to the virtual machines depending on the chosen and supported compatibility level. If you would add a newer host to the cluster, containing newer CPU packages, EVC would potentially hide the new CPU instructions to the virtual machines. By doing so, EVC ensures that all virtual machines in the cluster are running on the same CPU instructions allowing for virtual machines to be live migrated (vMotion) between the ESXi hosts.

EVC determines what instructions to mask from the guest OS by using the CPUID. Basically, you can look at the CPUID being an API for the CPU. It allows EVC to learn what instruction-sets the CPU is capable of doing, and what instructions needs to be masked depending on the configured EVC baseline. When EVC is enabled on the cluster, all ESXi hosts in the cluster must adhere to that setting.

This VMware KB article goes into more detail about all current EVC baselines and what CPU instructions they expose to the virtual machines.

Per-VM EVC

EVC is a cluster level setting that supports virtual machine mobility within a cluster. When a virtual machine is moved to another cluster, either on-prem or in a hybrid cloud environment, it loses its EVC configuration depending on the destination environment. Next to that, it is challenging to change a cluster EVC baseline in a environment with live workloads.

By implementing per-VM EVC, the EVC mode becomes an attribute of the virtual machine rather than the specific processor generation it happens to be booted on in the cluster. This helps to be more flexible with EVC enablement and baselines. We introduced the per-VM EVC feature in vSphere 6.7. Virtual machine hardware version 14 or up is required to enable per-VM EVC. When a virtual machine is powered off, you can change the per-VM EVC baseline.

 

The per-EVC configuration is saved in the vmx file. The vmx file is the file used as a value dictionary that provides the configuration of the virtual machine. If the virtual machine is migrated to another cluster, the per-EVC configuration is moving along with the virtual machine itself. The vmx file will contain featMask.vm.cpuid lines like the following when per-VM EVC is enabled:

featMask.vm.cpuid.Intel = “Val:1”
featMask.vm.cpuid.FAMILY = “Val:6”
featMask.vm.cpuid.MODEL = “Val:0x4f”
featMask.vm.cpuid.STEPPING = “Val:0”
featMask.vm.cpuid.NUMLEVELS = “Val:0xd”

Customer Feedback

A recent Twitter poll showed some interesting results and feedback. It looks like 80% of our customers are in fact using EVC. However, taking a look at our telemetry data, the number of EVC enabled clusters or virtual machines showed a slightly different picture. It’s good to see that a large proportion of our customer-base already benefits from EVC by having it enabled on their clusters and/or virtual machines.

Going through the comments, the general consensus around having EVC enabled by default differs. We see a lot of customers that understand enabling EVC in a brownfield environment is challenging, so they opt to enable EVC from the start. On the other hand, we see customers who didn’t enable EVC because they have a uniform clusters and don’t see the value of having it enabled. It is important to understand that the EVC feature itself has zero overhead on your virtual infrastructure. However, it can save you from the burden of enabling cluster EVC later on when you might want to scale your cluster with additional hosts that might contain newer CPU versions.

Another concern of customer is the impact on performance. What about workloads that cannot use the latest and greatest CPU instructions because of the configured EVC baseline? It does depend on the workloads, but overall we don’t see significant impact on performance because of all new CPU instructions not being exposed to the application running inside the guest OS. VMware released a paper that goes into detail on this topic.

To enable EVC on a live environment with virtual machines powered on, you would need to power down the virtual machines in order to change the EVC configuration. This is an area where per-VM EVC helps. Check out this extensive post by Kyle Ruddy on how you can enable per-VM EVC in an automated way.

Check EVC Configurations

To gain insights of your environment and what EVC configurations are used, scripting can be utilized. The following snippets allows for creating an overview that includes the virtual machines and the virtual machine EVC level next to the cluster EVC level. Because it is tabular data, it is easily exported to a CSV file by adding | ConvertTo-CSV.

Running this PowerCLI command will give you insights on what EVC baselines are configured when EVC is enabled. When the cluster EVC mode is enabled, the script will show the used baseline. It also shows the VM hardware version because per-EVC is only available for VM hardware version 14 and up. If a virtual machine is configured with per-VM EVC, the baseline could differ from the cluster EVC configuration. In the following explamary output you’ll notice the virtual machine ‘DB01’ with a different per-VM EVC baseline as opposed to the cluster setting.

This is a supported situation. However, if a virtual machine per-VM EVC baseline is higher than supported by the ESXi hosts in the cluster, the virtual machines will not power on because there is no host compatible with its per-VM EVC baseline.

You should always verify if your ESXi hosts support the configured EVC baselines to ensure it can accommodate the virtual machines running a per-VM EVC configuration. If you need information about your ESXi hosts and what the maximum supported EVC level is, you could issue the following PowerCLI command:
Get-VMHost | Select-Object Name,ProcessorType,MaxEVCMode

The output shows you the ESXi hosts, what CPU packages they are running, and the maximum supported EVC baseline.

To Conclude

As stated before, the key takeaway is that the general recommendation is to have EVC enabled. For a more granular approach and hybrid cloud support, the per-VM EVC feature is a good starting point when implementing EVC in your virtual infrastructure. Having the EVC feature enabled will allow you to benefit from workload mobility to the fullest!

Read More

Exploring the GPU Architecture

A Graphics Processor Unit (GPU) is mostly known for the hardware device used when running applications that weigh heavy on graphics, i.e. 3D modeling software or VDI infrastructures. In the consumer market, a GPU is mostly used to accelerate gaming graphics. Today, GPGPU’s (General Purpose GPU) are the choice of hardware to accelerate computational workloads in modern High Performance Computing (HPC) landscapes.

HPC in itself is the platform serving workloads like Machine Learning (ML), Deep Learning (DL), and Artificial Intelligence (AI). Using a GPGPU is not only about ML computations that require image recognition anymore. Calculations on tabular data is also a common exercise in i.e. healthcare, insurance and financial industry verticals. But why do we need a GPU for these types of all these workloads? This blogpost will go into the GPU architecture and why they are a good fit for HPC workloads running on vSphere ESXi.

Latency vs Throughput

Let’s first take a look at the main differences between a Central Processing Unit (CPU) and a GPU. A common CPU is optimized to be as quick as possible to finish a task at a as low as possible latency, while keeping the ability to quickly switch between operations. It’s nature is all about processing tasks in a serialized way. A GPU is all about throughput optimization, allowing to push as many as possible tasks through is internals at once. It does so by being able to parallel process a task. The following exemplary diagram shows the ‘core’ count of a CPU and GPU. It emphasizes that the main contrast between both is that a GPU has a lot more cores to process a task.

Differences and Similarities

However, it is not only about the number of cores. And when we speak of cores in a NVIDIA GPU, we refer to CUDA cores that consists of ALU’s (Arithmetic Logic Unit). Terminology may vary between vendors. (more…)

Read More