objective 8.2 – Performing basic troubleshooting for VMware FT and Third-party clusters
Analysing and evaluating VM population for maintenance mode considerations
The first consideration should be which virtual machines have to be evacuted off the host? and which could virtual machines would have to be turned off i.e. those with not using shared storage? The second consideration should be is there enough spare capacity on the remaining hosts to evacute the virtual machines on the host you want to put in maintenance mode? You can calculate this by using the DRS resource usage charts, if VMware HA is enabled then there should be a least one servers worth of capacity available. If you’re not using the VMware cluster features then you’ll need to manually mannage resource usage.
Understanding manual third-party failover and failback processes
The third-party clustering solutions supported by VMware is MSCS in the following scenarios:
- single host
- two hosts (cluster across both hosts)
- virtual machine to physical machine
In the virtual machine to physical machine cluster you would set the preferred host to the physical machine and configure failback so the physical machine is used when it is available.
Troubleshooting fault tolerance partial or unexpected failures
VMware FT guidelines
Maximum of four FT virtual machines per host (primaries and secondaries)
A dedicated NFS appliance with gigabit ethernet
If using resource pools the memory reservation must include all virtual machines allocated plus virtualisation overhead.
No more than 16 virtual disks per FT virtual machine
A minimum of three hosts per VMware cluster.
Troubleshooting VMware FT
If you experience partial storage hardware failures such as degraded performance or complete loss of connectivity then a FT virtual machine will failover to its secondary. The VMKernel logs and ESXTOP can help identity degraded performance.
Partial network hardware failure will cause a failover if it’s the FT logging network fails. Dedicating a network link to FT and VMotion logging with NIC teaming is best practice.
If the FT logging network experiences congestion then this will cause virtual machines to failover. Spreading out high activity virtual machines between other hosts should stop this happening.
You may find a very busy virtual machine may not VMotion to another host whilst FT is enabled. VMotion when the virtual machine is less busy.
Lots of VMFS locking can cause FT failovers, VMFS locking is caused by VM power ons, power offs, snapshots and VMotions. Limit the no. of these activities on a volume where FTvirtual machines exist. Also be cautious of free space on the VMFS volume, if free space the secondary may not power on.
Other troubleshooting problems
Hardware virtualisation (AMD-V and Intel-VT) must be enabled in the BIOS and processors must be compatible.
If the host processor is overcommitted where the secondary FT virtual machine is located then the primary FT virtual machine may have to slow down to keep in sync; consider applying CPU reservations to stop this happening.
If a virtual machine has 15GB or more RAM and the RAM is changing too frequently then it is possible that VMware FT and vMotion for that matter may not be able to copy or synchronise the RAM quickly enough for either operation to complete successfully. Consider changes ft.maxSwitchoverSeconds=#, where # is the no. of seconds.
Expect to see a higher no.of CPU cycles being consumed where secondary FT virtual machine are located as the replay operation of FT consumed a higher no. of CPU cycles.