Hello,
We had a system with 1 ESXi 5.1 host with local disks.
Now we install redundancy by adding an ESXi 5.5 U2 host and a vCenter 5.5 appliance.
After installing and adding everything to vcenter, we upgraded the ESXi 5.1 to ESXi 5.5 U2. The SAN is operating correctly (vMotion is working on seperate NIC).
Now, if I try to enable High Availability, both servers will install the HA Agent, and start "Election".
All datastores (4) on the SAN are chosen for the HA heartbeat, isolation response is "keep powered on" default.
One server will always get this process done, and the other will keep "electing" until it gets to 100% and errors on the election "operation timed out".
I have seen this problem on both servers, so I think the elected "master" does not have the problem, only the "slave".
I have checked these articles and executed them, but non worked:
VMware KB: Reconfiguring HA (FDM) on a cluster fails with the error: Operation timed out
- The services were running
VMware KB: Configuring HA in VMware vCenter Server 5.x fails with the error: Operation Timed out
- All MTU's were set to 1500
VMware KB: Configuring VMware High Availability fails with the error: Cannot complete the configuration of the HA ag…
- the default gateway was not the same on both hosts, but I corrected this. There are no routings. HA setting is "leave powered on". After correcting and disabling/reenabling HA, problem is still the same.
VMware KB: Verifying and reinstalling the correct version of the VMware vCenter Server agents
- I executed "Reinstalling the ESX host management agents and HA agents on ESXi" for the HA Agent, and I verified that it was uninstalled and reinstalled when reenabling HA.
cp /opt/vmware/uninstallers/VMware-fdm-uninstall.sh /tmp
chmod +x /tmp/VMware-fdm-uninstall.sh
/tmp/VMware-fdm-uninstall.sh
I did this for both hosts. This actually fixed the election problem, and I was even able to run a HA test succesfully, but when after this test I powered down the 2nd server (to test the HA in the other direction), HA did not do the failover to the 1st and everything remained down. After pushing "reconfigure HA", the election problem appeared again on 1 of the hosts.
These are some extractions from the logs:
-The vSphere HA availability state of this host has changed to Election info 11/29/2014 10:03:00 PM 192.27.224.138
-vSphere HA agent is healthy info 11/29/2014 10:02:56 PM 192.27.224.138
-The vSphere HA availability state of this host has changed to Master info 11/29/2014 10:02:56 PM 192.27.224.138
-The vSphere HA availability state of this host has changed to Election info 11/29/2014 10:01:26 PM 192.27.224.138
-vSphere HA agent is healthy info 11/29/2014 10:01:22 PM 192.27.224.138
-The vSphere HA availability state of this host has changed to Master info 11/29/2014 10:01:22 PM 192.27.224.138
-The vSphere HA availability state of this host has changed to Election info 11/29/2014 10:03:02 PM 192.27.224.139
-Alarm 'vSphere HA host status' on 192.27.224.139 changed from Green to Red info 11/29/2014 10:02:58 PM 192.27.224.139
-vSphere HA agent for this host has an error: vSphere HA agent cannot be correctly installed or configured warning 11/29/2014 10:02:58 PM 192.27.224.139
-The vSphere HA availability state of this host has changed to Initialization Error info 11/29/2014 10:02:58 PM 192.27.224.139
-The vSphere HA availability state of this host has changed to Election info 11/29/2014 10:00:52 PM 192.27.224.139
-Datastore DSMD3400DG2VD2 is selected for storage heartbeating monitored by the vSphere HA agent on this host info 11/29/2014 10:00:49 PM 192.27.224.139
-Datastore DSMD3400DG2VD1 is selected for storage heartbeating monitored by the vSphere HA agent on this host info 11/29/2014 10:00:49 PM 192.27.224.139
-Firewall configuration has changed. Operation 'enable' for rule set fdm succeeded. info 11/29/2014 10:00:45 PM 192.27.224.139
-The vSphere HA availability state of this host has changed to Uninitialized info 11/29/2014 10:00:40 PM Reconfigure vSphere HA host 192.27.224.139 root
-vSphere HA agent on this host is disabled info 11/29/2014 10:00:40 PM Reconfigure vSphere HA host 192.27.224.139 root
-Reconfigure vSphere HA host 192.27.224.139 Operation timed out. root HOSTSERVER01 11/29/2014 10:00:31 PM 11/29/2014 10:00:31 PM 11/29/2014 10:02:51 PM
-Configuring vSphere HA 192.27.224.139 Operation timed out. System HOSTSERVER01 11/29/2014 9:56:42 PM 11/29/2014 9:56:42 PM 11/29/2014 9:58:55 PM
Can someone please provide me with some help here?
Or extra things I can check or provide?
I am running out of options currenty.
Best Regards,
Joris
P.S. I had problems with Cold Migration when implementing the SAN. After setting up everything (vMotion, upgrading ESX), these problems were gone.
When searching for this error, I came to this article: VMware KB: VMware vCenter Server displays the error: Failed to connect to host
And that cause could make sense, since the vCenter server changed and IP addressing was changed during implementation.
However, in the vpxa.cfg files, the <hostip> and <serverip> is correct (checked using https://<hostip>/host).
Tried this again today, no problem at all.
P.P.S. I have configured more of these systems from scratch in the past with no problem (though this is an 'upgrade').