Over the last few days I’ve been having an issue with a pair of ESXi 5.0 servers in a clustered pair. All appeared to be working normally then one day vMotion stopped working and the progress would hang at 9%. Furthermore ESXi itself would then stop responding to vCenter requests.
Eventually the vMotion would timeout and the host would come back online (or you can kick it via ssh and restart the hostd service). Looking at the events I could see that the system has lost contact with the iSCSI storage during the hang.
The problem was intermittent, occasionally it would work then break again which made it harder to diagnose.
The problem in the end was caused by a mismatch in the MTU sizes on the iSCSI uplinks.
I had created a vSwitch for iSCSI and added two port groups then assigned one uplink per port group. Unfortunately one of the port groups was set at MTU = 1500 and the other at MTU = 9000. This configuration was on both ESXi servers.
Once I set the MTUs all to 9000 the problem went away.