Fault Tolerance is something which is very crucial in certain business critical environments.
We do not want the VM's to have a single second of downtime when the host goes down. What do we do? We enable Fault Tolerance on this.
Now legacy FT provided fault tolerance for VMs with one vCPU. And if we wanted to have FT protection for vCenter VM for example, it was not possible. Why? Because vCenter VM requires a minimum of 2 vCPUs.
Now, the FT support and the way FT works is re-designed in 6.0
Here, I will provide a basic understanding of the new features of FT and how it works.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Basically, FT provides zero downtime for a VM in a HA cluster.
The basic working would be, we have Host 1 which has VM with FT enabled. This will be the primary VM. This results in a secondary VM being created on host 2 running simultaneously and in sync with the VM on host 1.
Should the host 1 go down then the VM on the host 2 will become the primary VM and another VM will be created on a new host in the cluster which becomes the secondary, and the process continues so on.
The old FT (Legacy FT) used a technology called 'Record and Replay' to keep the primary and the secondary VM in sync with each other. This is no longer used in vSphere 6.0. However, it will exist for VMs that was upgraded to vSphere 6.
The legacy FT supports only one vCPU.
In FT 6.0 we have added support for VMs with more than 2 vCPU. This is nothing but a SMP VM. Symmetric Multiprocessing VM is a virtual machine which has more than one vCPU. This usually provides a improved application performance.
Some other features of the new FT are, supports up to 64 GB of memory for the VM, supports vMotion of primary as well as secondary VMs. The primary and secondary can be independently vMotioned without breaking the FT link.
However, we cannot create a snapshot from the vSphere or the command line.
The main feature of FT 6 is, it creates a secondary copy of the VM files on a secondary datastore. This provides against datastore failures as well. However, sharing of files between primary and secondary is not supported
What the new FT provides that the legacy FT did not:
1. EVC compatible clusters is supported in FT 6, however legacy FT cannot be enabled in EVC enabled cluster. Which meant for legacy FT to run we need to have servers having the same CPU and ESXi versions.
2. Hot enable of FT is now supported.
3. Legacy FT used only thick provisioned eager zeroes disk. However, in the new FT we can use thin disk as well.
4. We have VMDK redundancy in case of datastore failure.
5. The network requirement is 10 GBps
6. DRS is partially supported in the new FT
FT 6 creates two complete VMs. Each with its own .vmx, VMDK files on different datastores. After a failover, a full copy of all the VM files must be done again so the new secondary will have the identical set of the VM files. This will be time consuming.
FT Split Brain. What is it?
Consider there are two hosts: Host 1 (Primary) and Host 2 (Secondary)
Primary has Primary VM and Secondary has secondary VM. And they will have constant communication over the network.
Generally, if the primary goes down, the Secondary will become the primary host and VM accordingly.
Now, let's consider a scenario where the network link between primary and secondary itself goes down. In this case the secondary will think the primary host is down and will elect itself as primary host. Well, here we have two primary hosts and two primary VMs, and no secondary at all.
This scenario is called the split brain. We want to avoid this situation.
So, we use something called as tie-breaker files. Well, well, well, what is this now?
Now this primary and the secondary will have access to a shared storage. In this storage we are going to put some files, to determine if the primary host is still available. The two files are Shared.vmft and .ftgeneration.
The .ftgeneration file is the one that determine if there is a split brain scenario. Using this file, the secondary will communicate with the primary even when the network up-link between the two is down.
We can also share the primary .vmx and the secondary .vmx files.
However, sharing of the VMDK files is not recommended as it will not provide redundancy.
How VMs communicated with each other?
1. Legacy FT
Record/Replay or Virtual Lockstep.
Here the primary VM will receive some data. This data was sent to the vCPU of the primary VM and another copy of the same data was sent to the vCPU of the secondary VM. They both executed the input at the same time and generated the same result. However, this process is much harder to run on SMP VMs, i,e: VMs with more than one vCPU as it is hard to determine which vCPU should receive which input.
2. New FT - Fast Checkpointing
Here instead of taking the input of primary and sending a copy of this input to secondary, the primary will now take the input, execute it and a result is sent to the secondary VM over the up-link. This is sent periodically between the two VMs.
When FT is first enabled, an exact copy of primary is created on the secondary host.
This process is similar like a modified version of XVMotion, copy the memory and disk of the VMs between hosts and datastores respectively.
Primary will holds each outgoing networking packets until it receives an acknowledgment from the secondary VM for the previously sent network packet. This is to ensure that both the VMs are in sync with the latest data.
As of now we can use only 1 10 GIG network.
Important notes:
FT can be enabled only within a HA enabled cluster.
DRS does not perform balance of FT machines. However, it will balance the non FT VMs.
FT is young, FT is promising and more yet to come in the further releases.
As always, stay informed, keep learning.
Suhas