VMware vSphere HA Mode for vGPU

The following post was written and scenario tested by my colleagues and friends Keith Keogh and Andrew Breedy of the Limerick, Ireland CCC engineering team. I helped with editing and art. Enjoy.

VMWare has added High Availability (HA) for virtual machines (VM’s) with an NVIDIA GRID vGPU shared pass-through device in vSphere 6.5. This feature allows any vGPU VM to automatically restart on another available ESXi host with an identical NVIDIA GRID vGPU profile if the VM’s hosting server has failed.
To better understand how this feature would work in practice we decided to create some failure scenarios. We tested how HA for vGPU virtual machines performed in the following situations:
1) Network outage: Network cables get disconnected from the virtual machine host.
2) Graceful shutdown: A host has a controlled shut down via the ESXi console.
3) Power outage: A host has an uncontrolled shutdown, i.e. the server loses AC power.
4) Maintenance mode: A host is put into maintenance mode.

The architecture of the scenario used for testing was laid out as follows:

Viewed another way, we have 3 hosts, 2 with GPUs, configured in an HA VSAN cluster with a pool of Instant Clones configured in Horizon.

clus1

1) For this scenario, we used three Dell vSan Ready Node R730 servers configured per our C7 spec, with the ESXi 6.5 hypervisor installed, clustered and managed through a VMWare vCenter 6.5 appliance.
2) VMWare VSAN 6.5 was configured and enabled to provide shared storage.
3) Two of the three servers in the cluster had an NVIDIA M60 GPU card and the appropriate drivers installed in these hosts. The third compute host did not have a GPU card.
4) 16 vGPU virtual machines were placed on one host. The vGPU VM’s were a pool of linked clones created using VMWare Horizon 7.1. The virtual machine Windows 10 master image had the appropriate NVIDIA drivers installed and an NVIDIA GRID vGPU shared PCI pass through device added.
5) All 4GB of allocated memory was reserved on the VM’s and the M60-1Q profile assigned which has a maximum of 8 vGPU’s per physical GPU. Two vCPU’s were assigned to each VM.
6) High Availability was enabled on the vSphere cluster.

Note that we intentionally didn’t install an M60 GPU card in the third server of our cluster. This allowed us to test the mechanism that allows the vGPU VM’s to only restart on a host with the correct GPU hardware installed and not attempt to restart the VM’s on a host without a GPU. In all cases this worked flawlessly, the VM’s always restarted on the correct backup host with the M60 GPU installed.

Scenario One

To test the first scenario of a network outage we simply unplugged the network cables from the vGPU VM’s host server. The host showed as not responding in the vSphere client and the VM’s showed as disconnected within a minute of unplugging the cables. After approximately another 30 seconds all the VM’s came back online and restarted normally on the other host in the cluster that contained the M60 GPU card.

Scenario Two

A controlled shutdown down of the host through ESXi worked in a similar manner. The host showed as not responding in the vSphere client and the VM’s showed as disconnected as the ESXi host began shutting down the VSAN services as part of ESXi’s shutdown procedure. Once the host had shut down, the VM’s restarted quite quickly on the backup host. In total, the time from starting the shutdown of the host until the VM’s restarted on the backup host was approximately a minute and a half.

Scenario Three

To simulate a power outage, we simply pulled the power cord of the vGPU VM’s host. After approximately one minute the host showed as not responding and the VM’s showed as disconnected in the vSphere client. Roughly 10 seconds later the vGPU VM’s had all restarted on the back up host.

Scenario Four

Placing the host in maintenance mode worked as expected in that any powered on VMs must first be powered off on that host before completion can occur. Once powered off, the VM’s were then moved to the backup host but not powered on (the ‘Move powered-off and suspended virtual machines to other hosts in the cluster’ option was left ticked when entering maintenance mode). The vGPU-enabled VM’s were moved to the correct host each time we tested this: the host with the M60 GPU card.

In conclusion, the HA failover of vGPU virtual machines worked flawlessly in all cases we tested. In the context of speed of recovery, we made the following observations with regard to the scenarios tested. The unplanned power outage test recovered the quickest at 1 minute 10 seconds followed by the controlled shutdown, with the network outage taking slightly longer again to recover.

Failure Scenario	Approximate Recovery Time
Network Outage	1 min, 30 sec
Controlled Shutdown	1 min, 30 sec
Power Outage	1 min, 10 sec

In a real world scenario the HA failover for vGPU VM’s would of course require an unused GPU card in a backup server for it to work. In most vGPU environments you would probably want to fill the GPU to its maximum capacity of vGPU based VM’s to gain maximum value from your hardware. In order to provide a backup for a fully populated vGPU host, it would mean having a fully unused server and GPU card lying idle until circumstances required it. It is also possible to split the vGPU VM’s between two GPU hosts, placing, for example 50% of the VM’s on each host. In the event of a failure the VM’s will failover to the other host. In this scenario though, each GPU would be used to only 50% of its capacity.

An unused back-up GPU host may be ok in larger or mission critical deployments but may not be very cost effective for smaller deployments. For smaller deployments it would be recommended to leave a resource buffer on each GPU server so that if there was a host failure, there is room on the GPU enabled server to facilitate these vGPU Enabled VM’s.