We have a small 2 host HA cluster running 5.1. We have several vSwitches (management, VM Network, iSCSI) each with 2 pNICs assigned with the pNICs being spread across different physical cards in the host.
Host Admission is disabled and Host Isolation is set to "Leave Powered On".
Yesterday it seems one of the cards freaked out resulting in each vSwitch losing redundancy.
At the time that this happened every VM on the host which had the problem logged an event along the lines of:
20/04/2013 16:06:56, vSphere HA unsuccessfully failed over vmname on vsphere1.domain.co.uk in cluster vsphere cluster in companyname. vSphere HA will retry if the maximum number of attempts has not been exceeded. Reason: The operation is not allowed in the current state.
Now, what I can't determine entirely from the events logged, but what I think happened is:
- Physical NIC freaked Out
- vSwitches lost redundancy
- Management Network and Datastore networks presumably lost some sort of heartbeat which forced HA to try to kick in
- iSCSI vSwitch was still functional (MPIO failover) and datastores were accessible so the VM's were running on the host so HA on the other host couldn't fail them over because they were running and had locks on the VM files.
Does this sound like a reasonable interpretation of events?