We are migrating to vSAN from a tradtiional SAN environment and are testing various failure scenarios as a part of the final validation of the new vSAN cluster, we have run tests to simulate motherboard networking failure, network card failure and host power issues to ensure HA works correctly.
Managment and vSAN networks are isoltated and run separate on both our existing SAN environment and the vSAN cluster
These tests all work fine:
- Both clusters, If a host loses power, HA restarts guest VMs on another host.
- On the existing cluster, using SANs, when both storage network cables are removed, HA restarts guest VMs on another host, as expected.
- On the existing cluster, using SANs, when both Management network cables are removed, HA restarts guest VMs on another host as expected
- On the new cluster, when both vSAN networking cables are removed, the HA works fine and guest VMs running on the affected host are restarted elsewhere.
However, on the new cluster, when both Mangement networking cables are removed, HA does nothing! This seems to be related to HA being on the vSAN network and not the management network as on vSAN it seems this is noticed but does not trigger any isolation response.
We have setup isolation addresses as recommended for the vSAN system (which uses the vSAN networking for HA communication) and setup the following advanced settings:
das.isolationAddress0 10.20.30.254
das.isolationAddress1 10.20.30.253
das.respectVmHostSoftAffinityRules true
das.respectVmVmAntiAffinityRules true
das.useDefaultIsolationAddress false
Even enabling guest heartbeat monitoring doesn't cause any HA actions. These are our HA settings:
Host failure Restart VMs Restart VMs using VM restart priority ordering.
Proactive HA Disabled Proactive HA is not enabled.
Host Isolation Shut down and restart VMs VMs on isolated hosts will be shut down and restarted on available hosts.
Datastore with Permanent Device Loss Power off and restart VMs Datastore protection enabled. Always attempt to restart VMs.
Datastore with All Paths Down Power off and restart VMs Datastore protection enabled. Ensure resources are available before restarting VMs.
Guest not heartbeating Reset VMs VM monitoring enabled. VMs will be reset.
Once the management network has failed, the VM clearly shows as 'disconnected':
Date Time: 05/02/2018, 17:09:42
Type: Information
Target: Failover-TEST-VM
Description: 05/02/2018, 17:09:42 Failover-TEST-VM on host host37.testvsan.com in TSTSITE is disconnected
The host, HOST37 also shows as disconnected and is recording various failure events:
Date Time: 05/02/2018, 17:09:42
Type: Error
Target: host37.testvsan.com
Description: 05/02/2018, 17:09:42 Host host37.testvsan.com in TSTSITE is not responding
Date Time: 05/02/2018, 17:09:43
Type: Error
Target: host37.testvsan.com
Description: 05/02/2018, 17:09:43 Alarm 'Host connection failure' on host37.testvsan.com triggered by event 819306 'Host host37.testvsan.com in TSTSITE is not responding'
05/02/2018, 17:19:36 Cannot synchronize host host37.testvsan.com.
Event Type Description: Failed to sync with the vCenter Agent on the host
Description: 05/02/2018, 17:23:45 Cannot synchronize host host37.testvsan.com
Event Type Description: Failed to sync with the vCenter Agent on the host
However, despite this clear issue, the HA State is showing as fine:
vSphere HA State
Status
Connected (Slave)
Description
The vSphere HA Agent on the host is connected to a vSphere HA Master Agent over the management network.
This state is the normal operating state for agents that are not the vSphere HA Master Agent.
The vSphere HA protected VMs on this host are monitored by one or more vSphere HA Master Agents, and the agents will attempt to restart the VMs after a failure.
I can see why changes were made to ensure HA and vSAN 'see' the same network partitions, but it seems that in moving from traditional SAN to vSAN we have less resilience than before, because we need two totally separate networks to fail to get an automatic HA failover.
On the non-vSAN system, the loss of managment/guest networking leaves the VMs useless but quickly triggers an isolation response.
But on the vSAN cluster, the total loss of managment/guest networking does not trigger any action by VMWare HA - it requires manual intervention to migrate totally isolated VMs onto another host.
Am I misunderstanding this, or have we lost the ability to cope with management/guest networking issues by moving to vSAN?