POSSIBLE DATA CORRUPTION ISSUE, MUST READ! Sorry for starting this way, but this is a big deal, and I wanted to make sure you catch this when scanning through the weekend spam. Apparently, VMware issued a support KB article on this issue a few weeks ago, but it totally flew under the radar (I have not seen a single tweet or blog about this). In short, any data flowing through the VM network stack may get corrupted - including but not limited to file copies, database interactions with remote clients, and so on.
The scariest part is that the scope of this issue is very significant. In fact, we might as well be facing the biggest data corruption issue in the history of virtualization. The issue may occur on any Windows Server 2012 VM with the default (E1000E) vNIC adaptor running on ESXi 5.0 and 5.1, which is probably around 20% of all VMs in the world as a conservative estimate. The easiest workaround is to change the vNIC type to VMXNET3 or E1000 (you should be able to apply this change in bulk with a PowerCLI script), or disable TCP Segmentation Offload in the guest operating system. Keep in mind that changing vNIC type may result in change of DHCP address, because the OS will see that as the new adaptor – so this may affect some applications. As such, disabling TCP Segmentation Offload might sometimes be a better choice, however keep in mind this will increase VM CPU usage.
Specifically to your backups, even if some of your backup infrastructure components are running in a Windows Server 2012 VM, you should be safe if you are using Veeam Backup & Replication 6.1 or later. This was when we added inline network traffic validation (and remediation) to work around some other data corruption issues involving faulty network equipment, I had a big story about this in a weekly digest over one year ago. However, unfortunately your actual production data may be corrupted, and unless you have backups going all the way back to your vSphere 5.x or Windows Server 2012 upgrade, this might be one of those cases of unrecoverable data loss...
As per VMware support KB, the investigation is still on-going, so I would not yet jump to a conclusion that this is a bug with VMware. For example, we did see one mysterious data corruption issue during weeks of automated stress testing of our Windows Server 2012 support. We called it “10 bad bits mystery” internally, and it was affecting network transfers on both physical and virtual hardware. Unfortunately, the issue was impossible to reproduce reliably, so our investigation with Microsoft went nowhere (and we have it covered with our network traffic validation anyway). But, if anyone from VMware R&D or support are reading this, feel free to reach out to me to discuss the corruption pattern and factors facilitating the issue surfacing – as this could be the same issue. |