Nov 102009
 

Today I ran into this issue with a customer and wanted to write on it so it does not happen to everyone.  Fault Tolerance on vSphere is an awesome solution to maximize uptime.  There is a CPU scenario that may be a challenge however:

This KB article (http://kb.vmware.com/kb/1008027) reads:

For VMware FT to be supported, the servers that host the virtual machines must each use a supported processor from the same category as documented below:

Intel Xeon based on 45nm Core 2 Microarchitecture Category:
3100 Series
33
00 Series
5200
Series (DP)
5400 Ser
ies
7400 Series

Intel Xeon base
d on Core i7 Microarchitecture Category:
3500 Series
5
5
00 Series

AMD 3rd Generation Opteron Category:
1300 and 1400 Serie
s
2300 and 2400 Series (DP)
8300 and 8400 Series (MP)

Please note the requirement “same category.”  As an example, if you have a server with a 54xx series Intel Processor and a Intel 55xx series processor (both have the technology for FT), you can vMotion and DRS between them (via EVC) but you cannot run a Fault Tolerant pair across them.  The Lockstep technology from Intel changed in the 35xx and 55xx CPUs and is not compatible with the previous generations of lockstep.

May 222009
 

I’m training all of my partner engineers this week and they always ask the toughest technical questions.  Thanks to Scott Phillips for asking me this one:

What does Fault Tolerance do to prevent a split brain if both Primary and Secondary VMs become isolated?

Fault Tolerance (FT) uses an on-disk generation number file.  When FT is enabled the primary VM creates a file on shared storage called generation.N where N is a counter number.  The secondary VM is started and when it connects to the primary, the primary tells the secondary what the generation number is.  Once the Primary or secondary detects that there is a failure in the other half of the VM pair, it will try to rename the generation.N file to generation.N+1.  If the rename succeeds, the VM takes over as being the Primary (or remains the primary if it already was) and takes corrective action to rebuild a secondary and become protected again.  If the rename of the generation.N file fails, that means that the other VM in the pair already renamed the file and took over and the current VM shuts down.

There you have it, the disk subsystem prevents both VM’s from becoming the primary at the same time and creating a split brain.