Exploring the GPU Architecture

A Graphics Processor Unit (GPU) is mostly known for the hardware device used when running applications that weigh heavy on graphics, i.e. 3D modeling software or VDI infrastructures. In the consumer market, a GPU is mostly used to accelerate gaming graphics. Today, GPGPU’s (General Purpose GPU) are the choice of hardware to accelerate computational workloads in modern High Performance Computing (HPC) landscapes.

HPC in itself is the platform serving workloads like Machine Learning (ML), Deep Learning (DL), and Artificial Intelligence (AI). Using a GPGPU is not only about ML computations that require image recognition anymore. Calculations on tabular data is also a common exercise in i.e. healthcare, insurance and financial industry verticals. But why do we need a GPU for these types of all these workloads? This blogpost will go into the GPU architecture and why they are a good fit for HPC workloads running on vSphere ESXi.

Latency vs Throughput

Let’s first take a look at the main differences between a Central Processing Unit (CPU) and a GPU. A common CPU is optimized to be as quick as possible to finish a task at a as low as possible latency, while keeping the ability to quickly switch between operations. It’s nature is all about processing tasks in a serialized way. A GPU is all about throughput optimization, allowing to push as many as possible tasks through is internals at once. It does so by being able to parallel process a task. The following exemplary diagram shows the ‘core’ count of a CPU and GPU. It emphasizes that the main contrast between both is that a GPU has a lot more cores to process a task.

Differences and Similarities

However, it is not only about the number of cores. And when we speak of cores in a NVIDIA GPU, we refer to CUDA cores that consists of ALU’s (Arithmetic Logic Unit). Terminology may vary between vendors.

Looking at the overall architecture of a CPU and GPU, we can see a lot of similarities between the two. Both use the memory constructs of cache layers, memory controller and global memory. A high-level overview of modern CPU architectures indicates it is all about low latency memory access by using significant cache memory layers. Let’s first take a look at a diagram that shows an generic, memory focussed, modern CPU package (note: the precise lay-out strongly depends on vendor/model).

A single CPU package consists of cores that contains separate data and instruction layer-1 caches, supported by the layer-2 cache. The layer-3 cache, or last level cache, is shared across multiple cores. If data is not residing in the cache layers, it will fetch the data from the global DDR-4 memory. The numbers of cores per CPU can go up to 28 or 32 that run up to 2.5 GHz or 3.8 GHz with Turbo mode, depending on make and model. Caches sizes range up to 2MB L2 cache per core.

Exploring the GPU Architecture

If we inspect the high-level architecture overview of a GPU (again, strongly depended on make/model), it looks like the nature of a GPU is all about putting available cores to work and it’s less focussed on low latency cache memory access.

A single GPU device consists of multiple Processor Clusters (PC) that contain multiple Streaming Multiprocessors (SM). Each SM accommodates a layer-1 instruction cache layer with its associated cores. Typically, one SM uses a dedicated layer-1 cache and a shared layer-2 cache before pulling data from global GDDR-5 memory. Its architecture is tolerant of memory latency.

Compared to a CPU, a GPU works with fewer, and relatively small, memory cache layers. Reason being is that a GPU has more transistors dedicated to computation meaning it cares less how long it takes the retrieve data from memory. The potential memory access ‘latency’ is masked as long as the GPU has enough computations at hand, keeping it busy.

A GPU is optimized for data parallel throughput computations.

Looking at the numbers of cores it quickly shows you the possibilities on parallelism that is it is capable of.  When examining the current NVIDIA flagship offering, the Tesla V100, one device contains 80 SM’s, each containing 64 cores making a total of 5120 cores! Tasks aren’t scheduled to individual cores, but to processor clusters and SM’s. That’s how it’s able to process in parallel. Now combine this powerful hardware device with a programming framework so applications can fully utilize the computing power of a GPU.

ESXi support for GPU

VMware vSphere ESXi supports the usage of GPU’s. You will be able do dedicate a GPU device to a VM using DirectPath I/O, or assign a partitioned vGPU to a VM using the co-developed NVIDIA GRID technology or using 3rd party tooling like BitFusion. To fully understand how GPU’s are supported in vSphere ESXi and how to configure them, please review the following blog series:

To conclude

High Performance Computing (HPC) is the use of parallel processing for running advanced application programs efficiently, reliably and quickly.

This is exactly why GPU’s are a perfect fit for HPC workloads. Workloads can greatly benefit from using GPU’s as it enables them to have massive increases in throughput. A HPC platform using GPU’s will become much more versatile, flexible and efficient when running it on top of the VMware vSphere ESXi hypervisor. It allows for GPU-based workloads to allocate GPU resources in a very flexible and dynamic way.

More resources to learn

Machine Learning with GPUs on vSphere

Why the Data Scientist and Data Engineer Need to Understand Virtualization in the Cloud

Running common Machine Learning Use Cases on vSphere leveraging NVIDIA GPU

Machine Learning with H2O – the Benefits of VMware

Read More

ESXi Network Troubleshooting Tools

In the previous post about the ESXi network IOchain we explored the various constructs that belong to the network path. This blog post builds on top of that and focuses on the tools for advanced network troubleshooting and verification. Today, vSphere ESXi is packaged with a extensive toolset that helps you to check connectivity or verify bandwidth availability. Some tools are not only applicable for inside your ESXi box, but also very usable for the physical network components involved in the network paths.

Access to the ESXi shell is a necessity as the commands are executed here. A good starting point for connectivity troubleshooting is the esxtop network view. Also, the esxcli network commandlet provides a lot of information. We also have (vmk)ping, traceroute at our disposal. However, if you are required to dig deeper into an network issue, the following list of tools might help you out:

  • net-stats
  • pktcap-uw
  • nc
  • iperf


We’ll start of with one of my favorites; net-stats. This command can get you a lot of deep dive insights on what is happening under the covers of networking on a ESXi host as it can collect port stats and . The command is quite extensive as it allows for a lot of options. The net-stats -h command displays all flags. The most common one being the list option. Use net-stats -l to determine the switchport numbers and MAC addresses for all VMkernel interfaces, vmnic uplinks and vNIC ports. This information is also used for input for other tools described in the blog post.

To give some more examples, net-stats can also provide in-depth details on what worldlets (or CPU threads, listed as “sys”) are spun up for handling network IO by issuing net-stats with the following flags: net-stats -A -t vW. Output provided by these options help in verifying if NetQueue or Receive Side Scaling (RSS) is active for vmnic’s by mapping the “sys” output to the worldlet name using i.e. the vsi shell (vsish -e cat /world/<world id>/name).

Using different options, net-stats provides great insights on network behaviour.



Read More

Understanding the ESXi Network IOChain

In this blog post, we go into the trenches of the (Distributed) vSwitch with a focus on vSphere ESXi network IOChain. It is important to understand the core constructs of the vSphere networking layers for i.e. troubleshooting connectivity issues. In a second blog post on this topic, we will look closer into virtual network troubleshoot tooling.


The vSphere ESXi network IOChain is a framework that provides the capability to insert functions into the network data-path regardless of the usage of a vSphere Standard Switch (VSS) or a vSphere Distributed Switch (VDS). The IOChain is a group of functions that provides connectivity between ports and the vSwitch. A port has two IOChains based on the direction to and from the vSwitch. Meaning each port in a set is associated with it an input and an output IOChain. This allows for a modular approach by only including optional elements in an IOChain as configured by the user.

Examples of optional elements in an IOChain are VLAN support, NIC teaming, and traffic shaping. Looking at the high-level components in an ESXi network IOChain, we differentiate between the port group, the vSwitch (VSS or VDS) and the uplink level.

Port group level

This is where an optional configured VLAN is interpreted by the VLAN filter, allowing for VLAN dot1q tags for your port group. The security settings Promiscuous mode, MAC address changes, and Forged transmits are also set at the port group level. The user can also optionally configure traffic shaping, either egress only when using a VSS or bi-directional traffic shaping when using a VDS.

vSwitch (VSS or VDS) level

Incoming packets at the vSwitch level are forwarded to their destination using the forwarding engine. Incoming packets at the vSwitch level are forwarded to their destination using the forwarding engine. The forwarding engine contains port information paired with MAC address information. It’s job is to send the traffic to its proper destination. That can be either a VM residing on the same ESXi host or an external host.

The teaming engine is responsible for balancing network packets over the uplink interfaces. The way it does so is depended on the chosen teaming configuration by the user. The traffic shaper module is added to the IOChain if enabled in the port group level.

Uplink level

At this level, the traffic sent from the vSwitch to an external host finds its way to the driver module. This is where all the hardware offloading is taking place. The Supported hardware offloading features depends strongly on the physical NIC in combination with a specific driver module. Typically supported hardware offloading functions that in NICs are TCP Segment Offload (TSO), Large Receive Offload (LRO) or Checksum Offload (CSO). Network overlay protocol offloading like with VXLAN and Geneve, as used in NSX-v and NSX-T respectively, are widely supported on modern NICs.

Next to hardware offloading, the buffer mechanisms come into play in the Uplink level. I.e., when processing a burst of network packets, ring buffers come into play. Finally, the bits transmit onto the DMA controller to be handled by the CPU and physical NIC onwards to the Ethernet fabric.

Standard vSwitch

The following diagram puts all components together to form the IO chain for vSphere networking using a standard vSwitch: (more…)

Read More

vSphere Networking : Bandwidth Reservations

To enforce bandwidth availability, it is possible to reserve a portion of the available uplink bandwidth using Network I/O Control (NIOC). It may be necessary to configure bandwidth reservations to meet business requirements with regards to network resources availability. In the system traffic overview, under the resource allocation option in the Distributed vSwitch settings, you can configure reservations. Reservations are set per system traffic type or per VM.

Strongly depending on your IT architecture, it could make sense to reserve bandwidth for specific business critical workload, vSAN network or IP storage network backend. However, be aware that network bandwidth allocated in a reservation cannot be consumed by other network traffic types. Even when a reservation is not used to the fullest, NIOC does not redistribute the capacity to the bandwidth pool that is accessible to different network traffic types or network resource pools.

Since you cannot overcommit bandwidth reservations by default, it means you should be careful when applying reservations to ensure no bandwidth is gone to waste. Thoroughly think through the minimal amount of reservation that you are required to guarantee for network traffic types.

For NIOC to be able to guarantee bandwidth for all system traffic types, you can only reserve up to 75% of the bandwidth relative to the minimum link speed of the uplink interfaces.

When configuring a reservation, it guarantees network bandwidth for that network traffic type or VM. It is the minimum amount of bandwidth that is accessible. Unlike limits, a network resource can burst beyond the configured value for its bandwidth reservation, as it doesn’t state a maximum consumable amount of bandwidth.

You cannot exceed the value of the maximum reservation allowed. It will always keep aside 25% bandwidth per physical uplink to ensure the basic ESXi network necessities like Management traffic. As seen in the screenshot above, a 10GbE network adapter can only be configured with reservations up to 7.5 Gbit/s.

Bandwidth Reservation Example


Read More

vSphere Networking : Traffic Marking

vSphere network quality control features like the Network I/O Control (NIOC) feature is focused on the virtual networking layer within in a VMware virtual data center. But what about the physical network layer and how the two can cooperate?

In converged infrastructures or enterprise networking environments, Quality of Service (QoS) is commonly configured in the physical network layers. QoS is the ability to provide different priorities to network flows, or to guarantee a certain level of performance to a network flow by using tags. In vSphere 6.7, you have the ability to create flow-based traffic marking policies to mark network flows for QoS.

Quality of Service

vSphere 6.7 supports Class of Service (CoS) and Differentiated Services Code Point (DSCP). Both are QoS mechanisms used to differentiate traffic types to allow for policing network traffic flows.

As related to network technology, CoS is a 3-bit field that is present in an Ethernet frame header when 802.1Q VLAN tagging is present. The field specifies a priority value between 0 and 7, more commonly known as CS0 through CS7, that can be used by quality of service (QoS) disciplines to differentiate and shape/police network traffic. Source: https://en.wikipedia.org/wiki/Class_of_service

One of the main differentiators is that CoS operates at data link layer in an Ethernet based network (layer-2). DSCP operates at the IP network layer (layer-3).

Differentiated services or DiffServ is a computer networking architecture that specifies a simple and scalable mechanism for classifying and managing network traffic and providing quality of service (QoS) on modern IP networks. DiffServ uses a 6-bit differentiated services code point (DSCP) in the 8-bit differentiated services field (DS field) in the IP header for packet classification purposes. Source: https://en.wikipedia.org/wiki/Differentiated_services

When a traffic marking policy is configured for CoS or DSCP, its value is advertised towards the physical layer to create an end-to-end QoS path.

Traffic marking policies are configurable on Distributed port groups or on the DvUplinks. To match certain traffic flows, a traffic qualifier needs to be set. This can be realized using very specific traffic flows with specific IP address and TCP/UDP ports or by using a selected traffic type. The qualifier options are extensive. (more…)

Read More

vSphere Host Resources Deep Dive: Part 3 – VMworld

In my previous post, I mentioned the part 3 of the Host Deep Dive session at VMworld 2018. The ‘3’ is because we ran the part 1 and 2 at VMworld 2016 and 2017. We had the chance to try a new way of discussing Host Resources settings by the way of creating levels of performance tuning. The feedback we received will test-driving this session at the London and Indianapolis VMUG, was really positive.

As we always stated, the out-of-the-box experience of VMware vSphere is good enough for 80-90% of common virtual infrastructures. We like to show how you can gradually increase performance and reduce latency with advanced vSphere tuning. That’s why we came up with the Pyramid of Performance Optimization. Delivering the content this way allows for better understanding on when to apply certain optimizations. We will start with the basics and work our way up to settings to squeeze out the maximum performance of vSphere ESXi hosts.

Due to session time constrains, we will focus on compute (CPU and Memory) and virtual Networking. The following pyramids contain the subjects about content we will discuss in our session.

Pyramid of Performance Optimization – Compute:

Pyramid of Performance Optimization – Networking:

We will go trough all these levels in detail. We very much look forward to VMworld and hope to see you there! Be sure to reserve your seat for this session!

The following VMworld sessions are all based on the Deep Dive series:

  • vSphere Clustering Deep Dive, Part 1: vSphere HA and DRS [VIN1249BU]
  • vSphere Clustering Deep Dive, Part 2: Quality Control with DRS and Network I/O Control [VIN1735BU]
  • vSphere Host Resources Deep Dive: Part 3 [VIN1738BU]

Read More