Hi Everyone.
I had a strange issue during my last installation project for a customer who wanted to migrate from his running HP installation to Dell. Following the details of the environment:
HP Bladechassis C7000 with Networking and SAN switches, Gen 7-9 Bladeservers.
Connected to a HP 5000sth coreswitch pair.
10 GBit network.
The config should look like that:
Dell FX2 chassis with FN410S network switches connected to the same HP cores via LACP channel.
Four FC630 servers with Intel network X520
VMware Environment:
ESXi 6.0U3 installed with the Dell and HP ISOs.
Dedicated VMKernel Ports for VMotion and Management running in different Subnets and VLANs.
What happened:
A VMotion process from an HP server to one of the new Dell servers finished successfully without any problems.
A VMotion from one Dell server to another one or even back to an HP server freezed at 57%. After a couple of seconds, the server disconnected from the vCenter view, the VMotion process shown as failed and a ping to the management interface of the ESXi server showed no connection anymore. The VMotion interface and the VMs were still pingable. A restart of the management network via KVM console brought the interface back instantly and then the vCenter showed that the VMotion task actually finished successfully and the VM is running on the new host. If you actually restart the management network as soon as it fails, the VMotion task shown in vSphere Client finishes successfully. The interface goes down as soon as you try to do anything related to the management network. Means stopping Guest OS, Restart Guest OS... Each time we did sth like that, the interface failed.
Solution:
After a lot of support calls with Dell and VMware, checking HCLs, checking installation environment, network channel configuration and such, the final statement from the support was: It should not do this... :-)
The only thing we found was the driver of the Intel netork card running on the Dell servers was the IXGBEN version 1.4.1. That one is listed in the HCL but there is a version 4.5.3 available as well, that one uses the IXGBE driver (no "n"). So as a final desperate task after two days of intensive testing and no idea, we installed that driver and rebooted the system. The server still keeps using the 1.4.1 driver so we deinstalled that one manually: esxcli software vib remove -n ixgben
The next reboot started the network with the new 4.5.3 driver and from that point everything is running fine. VMotion, VM tasks, no problem anymore. We can migrate in any direction and do everything we want.
I could not find anything related to this problem so I just wrote that Blog entry here for any similar installation running into that problem as well. As I said: Both drivers are in the HCL. They are included in the available vendor ISOs linked on the support Page but it does not work with that.
Regards
Terence