vSphere Fault Tolerance Performance Impact

Triggered by some feedback on the VMware reddit channel, I was wondering what is holding us back in adopting the vSphere Fault Tolerance (FT) feature. Comments on Reddit stated that although the increased availability is desirable, the Fault Tolerance performance impact is holding them back from actually using it in production environments.

Use cases for FT could be, according to the vSphere 6 documentation center:

Applications that need to be available at all times, especially those that have long-lasting client connections that users want to maintain during hardware failure.
Custom applications that have no other way of doing clustering or other forms of application resiliency.
Cases where high availability might be provided through custom clustering solutions, which are too complicated to configure and maintain.

However, the stated use cases only focus on availability and do not seem to incorporate a performance impact when enabling FT. Is there a sweet spot for applications that do need high resiliency, but do not require immense performance and could coop with a latency impact due to FT? It depends on the application workload. An SQL server typically generates more FT traffic than for instance a web server that primarily transmits. So the impact of enabling FT will impact some workloads more than others.

Requirements

Since the introduction of vSphere 6: Multi-Processor Fault Tolerance (SMP-FT), the requirements for FT are a bit more flexible. The compute maximums for an FT-enabled VM are 4 vCPUs and 64GB memory. The use of eager zero thick disks is no longer a requirement. So thin, lazy zeroed thick, and eager zero thick provisioned disks are all supported in SMP-FT!

It is still required that you use a layer-2 network for FT. There is no strict requirement for bandwidth and network performance, although it is stated that a minimum of 10GbE is more suitable for an FT network. Keep in mind that the underlying network for FT is an important component of the performance of FT-enabled VMs.

You could opt for a 1GbE FT network, but when enabling FT it will trigger the following warning, but you will be able to use 1GbE FT networks. Looking at the tests I’ve done, it wouldn’t take too many FT-enabled VMs to consume 1 Gigabit. So go with 10GbE if possible!

Configure FT

Enabling Fault Tolerance is quick and easy. Just select your datastore for the secondary VM. Remember, this does not have to be the same datastore any more. On the contrary, you want to select another datastore for your secondary VM, possibly maximizing availability when using another storage backend for it.
After that, select a host for your secondary VM:

And you’re good to go!

Performance tests

To determine if and how performance is impacted when using Fault Tolerance, I created a test scenario using the benchmarktool DVDstore.

I used a small setup containing a MSSQL server (8GB mem, VMXNET3 NIC and VMware paravirtual SCSI controller) and a DVDstore Client server. The results of the tests are based on measurements of the CPU usage of the MSSQL server. The FT network is a layer-2 10GbE network. During testing, no other traffic traversed that switch. I did the same DVDstore test for all the scenario’s.

The DVDstore command used: ds2sqlserverdriver.exe –target=192.168.150 –run_time=15 –db_size=20GB –n_threads=25 –ramp_rate=5 –pct_newcustomers=10 –warmup_time=0 –think_time=0.085

FT Bandwidth utilization can be seen in vCenter. When you enable FT for a VM, the following ‘widget’ appears.

Results

The following results were recorded during tests.

Adding more vCPU’s to the game shows the applications ability to use multiple threads. So an increase of the DVDstore’s OPM (Order per Minute) value is noticeable when adding more vCPU, only to drop when FT is enabled. A consistent drop of ~47% is noted during FT enabled VM tests.

	FT disabled	FT enabled	Difference
OPM test 1 vCPU	12291	6418	-48%
OPM test 2 vCPU	13164	7023	-47%
OPM test 4 vCPU	14139	7458	-47%

Also, the consumed FT bandwidth grows substantially when adding vCPU’s to FT. 1 vCPU shows an average bandwidth utilization of ~110Mbit, 2 vCPUs ~140Mbit and 4 vCPU’s around 180Mbit. As said before, this all depends on your application characteristics.

To conclude

These tests are intended to give you an rough idea on how performance is impacted using FT. As expected, performance is impacted quite a bit. On the other hand, note that the VM is now provided with continuous availability, with no downtime or data-loss in the event of a host failure!

So the question is; will the improved availability outweigh the performance penalty?