I'm an independent consultant, working with one of my clients. We are in the process of upgrading our farm from 4.1 to 5.1 and ran into a performance problem as follows:
Servers are Dell R710 with dual QLA8152 CNA cards in them (2 ports each). We have two Cisco Nexus 5020 switches, each Server HBA has one port plugged into each Nexus (crossing ports and nexi)
The farm is configured with a vCenter 4.1 Distributed Switch with 4 uplinks (each CNA). Long ago, we enabled Jumbo frames on the switches, ports, and Nexi.
At this point, I've got the vCenter upgraded to 5.1, but have not upgraded the vDS (it is still at 4.1). I've upgraded two of the Hosts to esxi 5.1 and patched. Am running latest firmware and drivers on everything.
My Nexuses are not running current IOS,but are on 4.2(1)NT(1a), and I noticed last week, that the 'show interface ethernet x/x' command shows the MTU as 1500. That got me concerned, until I found information that this is 'just a cosmetic' issue and isn't really true.
Fast forward to today: After upgrading the second server, I was running some basic vmotion tests on non-critical VM's Doing a single vmotion works everytime, but when I attempt to vmotion 5 vm's at once, I got very inconsistent results where sometimes one would fail to vmotion, and sometimes all would fail.
The vmkernel logs were showing that the connection between the hosts was failing. (Note, I didn't see any notifications of physical failures in the Nexus.)
After multiple attempts to resolve the problem, I got a different failure as documented by http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2036890
I went in an set my VMotion MTU on all servers to 1500 and the problem appears to have disappeared.
So, at this point, I'm left wondering:
1) Does ESX 5.1 behave differently when connected to the Nexus where it is reporting an MTU of 1500 than does ESX 4.1?
2) Should I upgrade the NX-OS to the current version.
Any thoughts would be appreciated.