I’m going through the immense archive of sessions from VMworld ’09, and am amazed by session BC2961 about VMware Fault Tolerance. There are a couple of things about FT that I did not know before.

Krishna Raj Raja, a Senior MTS at the Performance Group of VMware explains what FT is, what it does and how it works in layman’s terms, so that even I can understand it:

Technology

VMware Fault Tolerance uses Record/Replay technology first found in VMware Workstation 6. This technology has been improved upon and included as a feature in vSphere 4. FT does not use binary translation, but heavily depends on hardware virtualization techniques from Intel and AMD. Hardware-optimization was proposed by VMware, and AMD/Intel included the changes in their newest processors. This hardware functionality is refered to as vLockStep. Check out VMware KB article 1008027 or use VMware Site Survey to check for compatible processor. Also, hardware MMU (AMD RVI or Intel EPT) is disabled for VM’s that have VMware FT turned on.

VMware FT works by recording non-deterministic events or inputs to the VM (disk reads, network receives (or rx), keystrokes, etc) and certain CPU events like RDTSC and interrupts. Recording these requires way less logging than recording every single CPU instruction.

Because vLockStep does not record and replay complete CPU instructions, but only certain events, CPU usage isn’t identical on either host. This could lead to a difference in CPU usage, and can cause the ‘vLockStep interval’ or execution lag to increase. Whenever the secondary host is busy, the primary VM will have to wait for the secondary VM. If the secondary host catches up with the primary (because CPU utilization goes down), the interval decreases.

High Level Architecture

01 BC2961 - FT High level architecture

When setting up FT, a VMotion is done to create the secondary VM, and record/replay is started as soon as the VMotion is done. The primary VM carries out all disk and network I/O. The secondary VM replays all recorded events, but does not do actual work on the disk or network, as the hypervisor does not carry out these I/O’s. The primary VM however, waits for the secondary to acknowledge all I/O, to make sure that the state of both VM’s is identical. This may and probably will increase latency, as the primary has to wait for the secondary to acknowledge before confirming the I/O to the client.

Log Bandwith Estimation

Average log bandwidth (Mbits/sec) = [(Avg. disk read throughput in Mbytes/sec * 8) + Avg. network receives (Mbits/sec) ] * 1.2
To reduce log bandwidth, the secondary VM can be made to read the disk contents directly. See KB article 1011965 on how to enable.

Performance Benchmarks

I highly recommend that you watch Krishna’s presentation online at VMworld.com to see all the cool performance benchmarks he has done, they give a great insight into the inner workings and behaviour of Fault Tolerance.

Summary

01 BC2961 - Summary

Further Reading

In an unrelated closing note, I’d like to point all my readers and fellow bloggers to the VMware vSphere Blogging Contest.