Triggered by some feedback on the VMware reddit channel, I was wondering what is holding us back in adopting the vSphere Fault Tolerance (FT) feature. Comments on Reddit stated that although the increased availability is desirable, the performance impact is holding them back to actually use it in production environments.
Use cases for FT could be, according to the vSphere 6 documentation center:
- Applications that need to be available at all times, especially those that have long-lasting client connections that users want to maintain during hardware failure.
- Custom applications that have no other way of doing clustering or other forms of application resiliency.
- Cases where high availability might be provided through custom clustering solutions, which are too complicated to configure and maintain.
However, the stated use cases only focus on availability and do not seem to incorporate a performance impact when enabling FT. Is there a sweet-spot for applications that do need high resiliency, but do not require immense performance and could coop with a latency impact due to FT? It really depends on the application workload. A SQL server typically generates more FT traffic then for instance a webserver that primarily transmits. So the impact of enabling FT will impact some workloads more then other.
Since the introduction of vSphere 6: Multi-Processor Fault Tolerance (SMP-FT), the requirements for FT are a bit more flexible. The compute maximums for a FT enabled VM are 4 vCPUs and 64GB memory. The use of eager zero thick disks is no longer a requirement. So thin, lazy zeroed thick and eager zero thick provisioned disks are all supported in SMP-FT!
It is still required that you use a layer-2 network for FT. There is not a strict requirement for bandwidth and network performance, although it is stated that a minimum of 10GbE is more suitable for a FT network. Keep in mind that the underlying network for FT is an important component for the performance of FT enabled VMs.
You could opt for a 1GbE FT network, but when enabling FT it will trigger the following warning, but you will be able to use 1GbE FT networks. Looking at the tests I’ve done, it wouldn’t take too much FT enabled VMs to consume 1 Gigabit. So go with 10GbE if possible!
Enabling Fault Tolerance is quick and easy. Just select your datastore for the secondary VM. Remember, this does not have to be the same datastore any more. On the contrary, you want to select another datastore for your secondary VM, possibly maximizing availability when using another storage backend for it.
And you’re good to go!
To determine if and how performance is impacted when using Fault Tolerance, I created a test scenario using the benchmarktool DVDstore.
I used a small setup containing a MSSQL server (8GB mem, VMXNET3 NIC and VMware paravirtual SCSI controller) and a DVDstore Client server. The results of the tests are based on measurements of the CPU usage of the MSSQL server. The FT network is a layer-2 10GbE network. During testing, no other traffic traversed that switch. I did the same DVDstore test for all the scenario’s.
The DVDstore command used: ds2sqlserverdriver.exe –target=192.168.150 –run_time=15 –db_size=20GB –n_threads=25 –ramp_rate=5 –pct_newcustomers=10 –warmup_time=0 –think_time=0.085
FT Bandwidth utilization can be seen in vCenter. When you enable FT for a VM, the following ‘widget’ appears.
The following results were recorded during tests.
Adding more vCPU’s to the game shows the applications ability to use multiple threads. So an increase of the DVDstore’s OPM (Order per Minute) value is noticeable when adding more vCPU, only to drop when FT is enabled. A consistent drop of ~47% is noted during FT enabled VM tests.
|FT disabled||FT enabled||Difference|
|OPM test 1 vCPU||12291||6418||-48%|
|OPM test 2 vCPU||13164||7023||-47%|
|OPM test 4 vCPU||14139||7458||-47%|
Also, the consumed FT bandwidth grows substantially when adding vCPU’s to FT. 1 vCPU shows an average bandwidth utilization of ~110Mbit, 2 vCPUs ~140Mbit and 4 vCPU’s around 180Mbit. As said before, this all depends on your application characteristics.
These tests are intended to give you an rough idea on how performance is impacted using FT. As expected, performance is impacted quite a bit. On the other hand, note that the VM is now provided with continuous availability, with no downtime or data-loss in the event of a host failure!
So the question is; will the improved availability outweigh the performance penalty?