In one of my projects I had a bad problem with vSphere environment . The issue had been occurred in following situation:
In the first episode VCSA server encountered with a low disk space problem and suddenly crashed. After increase size of VMDK files and fix the first problem, I saw one of the ESXi host belongs to the cluster is unreachable (disconnected and also vCenter cannot connect to it, but both of them is reachable by my client system. In a SSH access I checked the ESXi host is accessible but vCenter server couldn't connect only to this host.
All network parameters and storage zone settings, and all time settings and service configuration were same for each hosts. Sadly syslog settings was not configured and we didn't have access to scratch logs in duration of the issue had been occurred (I don't know why). Trying to restart all management agents of the host was suspended and suppressing to it by running services.sh restart process was stuck and nothing really happened. also trying to restart vpxa and hostd didn't fix the issue.
There was only one error in summary tab of disconnected host that described about the vSphere HA that is not configured and ask to remove and add the host again to the vCenter. But I couldn't reconnect it. My only guess is it's only related to startup sequence of ESXi hosts and storage systems because tech support unit restarted some of them after confronting to the problem, So HA automatically tried to migrate VMs of that offline hosts to other online hosts and this is the moment I want to call it "Complex Disaster". So was stuck decided to disable HA and DRS on cluster settings, nothing changed! problem still existed. After fixing the VCSA problem I knew if we restart that host, maybe the second problem will be solved but because of a VM operation, we couldn't do it. Migration did not work and we were confused.
Then I tried to shutdown some of not-necessary VMs belong to the disconnected host. after releasing some CPU/RAM resources, this time management agent restart was done successfully (services.sh restart operation)
So trying to connect VCSA to that problematic ESXi was possible and the problem was gone forever!
After that I wrote a procedure for that company's IT Department as the Virtualization Checklist:
1. Attend to your VI's assets logs. Don't forget to keep them locally in a safe repository and also in a syslog server.
2. Always monitor used and free process/memory resources of cluster. Never override their thresholds, because a host failure may cause to consecutive failures
3. Control status of virtual infrastructure management services include vCenter Server, NSX Manager and also their disk usage. Execute "df -h" in CLI or check status of their VMDKs in GUI. (I explained about how to do it in this post)
4. In critical situations or even maintenance operations always first shutdown your ESXi hosts and then storage systems and for reloading the system first start the storage, then the hosts.
5. In the end, please DO NOT disconnect vNIC of VCSA from associated Port Group if it is part of a Distributed vSwitch. They did it and it's made me to suffer a lot to reconnect VCSA. Even if you restore a new backup of VCSA, don't remove network connectivity of failed VCSA until the problem is not solve.
Link to my personal blog: Undercity of Virtualization: An Example of Importance of Management and Controlling Virtual Infrastructure Resources