Vmware High Availability (HA) is a great feature that allows a guest Virtual Machines in a Cluster to survive a host failure. Some quick background is that a Cluster is a group of hosts that work together harmoniously and operate as a single unit. A host is a physical machine running a Hypervisor such as ESX.
So, what does HA do? If a host in the cluster fails then all of the machines fail. HA will power up the guests on another host in the cluster which can reduce downtime significantly, especially if your Datacenter is 30 minutes from your house at 2am. You can continue to sleep and address the host failure in the morning. Sounds great, so what’s the catch?
The catch is in how HA configures itself in the cluster. The first 5 hosts in a cluster are called primary node and all the other hosts are secondary nodes. A primary node synchronizes settings and status of all hosts in the cluster with other primary nodes. A secondary node basically reports its status to the primary node. Secondary nodes can be promoted to primary nodes, but only under specific circumstances. Circumstances include: putting a host in maintenance node and disconnecting a node from a cluster. HA only needs one primary node to function. I don’t see a catch here…?
The catch comes into the use of a blade center. Suppose you have Chassis A and Chassis B:
We bought two blade chassis for redundancy. Redundant power, switches, electricity, and cluster hosts spread across both. If one chassis fails then other one has plenty of resources. Fully redundant! Maybe. If I was to add my first 5 hosts to my cluster from chassis A then all of my primary nodes would be on chassis A. If chassis A fails, NO guests from the failed host will be powered up on chassis B. Why? All chassis B hosts are secondary nodes and HA requires at least 1 primary! It’s 2 am and now you’re half asleep driving to the datacenter despite all the redundancy.
To avoid this issue, when adding hosts to a cluster, alternate between chassis.