In this series of post we will learn how vSphere functionality has been changed between 4.x versions and 5.x versions. I will not point out any configuration maximums that has been changed between the 2 versions of vSphere as there are many posts available on internet on the same. Instead I will try to touchdown on some of the features that has been changed/enhanced in the newer version of vSphere.
This is the first part of this series and I will focus primarily on how HA working/functionality has been changed over the years.
Change in HA agent and its functionality
In vSphere 5.x name of the HA agent is FDM (Fault Domain Manager). FDM has replaced what was once known as AAM (Automated Availability Manager), the HA agent prior to vSphere 5.x.
FDM is one of the most important agents on an ESXi host. Contrary to AAM, FDM uses a single-process agent. However, FDM spawns a watchdog process. In the unlikely event of an agent failure, the watchdog functionality will pick up on this and restart the agent to ensure HA functionality remains without anyone ever noticing it failed.
Also when an Esxi host is added to HA enabled cluster, then vCenter server is responsible for pushing out the HA agent to the ESXi hosts. Prior to vSphere 5, the push of these agents were done in a serial fashion. With vSphere 5.0, this is done in parallel to allow for faster deployment and configuration of multiple hosts in a cluster
HA dependency on DNS has been removed
As of vSphere 5.0, HA is no longer dependent on DNS as it works with IP addresses only. Also the character limit that HA imposed on the hostname has been lifted. (Pre-vSphere 5.0, FQDNs were limited to 26 characters.).
Note: This does not mean that ESXi hosts need to be registered with their IP addresses in vCenter; it is still a best practice to register ESXi hosts by FQDN in vCenter.
Changes in Logging Method
Prior to vSphere 5.0, the HA log files were not sent to syslog. vSphere 5.0 brings a standardized logging mechanism where a single log file has been created for all operational log messages; it is called fdm.log. This log file is stored under /var/log/
Datastore Heartbeating is introduced
Prior to vSphere 5.0, virtual machine restarts were always attempted, even if only the heartbeat network was isolated and the virtual machines were still running on the host.
This has been mitigated by the introduction of the datastore heartbeating mechanism. Datastore heartbeating adds a new level of resiliency and prevents unnecessary restart
attempts from occurring.
Primary/Secondary node concept has been removed
Prior to vSphere 5.0, HA used the concept of primary secondary nodes when an Esxi was added to HA enabled cluster. In a cluster there can be max of 32 Esxi hosts and out of that 5 can be primary (usually the first 5 hosts that join the cluster) and remaining would be secondary.
Primary nodes hold cluster settings and all “node states” which are synchronized between primaries. Node were holding instance resource usage information. In case that vCenter is not available the primary nodes will have a rough estimate of the resource occupation and can take this into account when a fail-over needs to occur. Secondary nodes send their state info to the primary nodes.
Nodes send a heartbeat to each other, which is the mechanism to detect possible outages. Primary nodes send heartbeats to primary nodes and secondary nodes. Secondary nodes send their heartbeats to primary nodes only
In vSphere 5.o master/slave concept was introduced and primary/secondary concept was removed. In a cluster, there will be one master (more than one master can also exist in case of network partition, default is 1) and remaining nodes will be slaves.
Once a master node is elected, slave nodes are restricted to talk to each other (unless re-election of master is required). Slave nodes only talks with master node.
Change in Virtual Machine Protection
The way virtual machines are protected has changed substantially in vSphere 5.0. Prior to vSphere 5.0, virtual machine protection was handled by vpxd which notified AAM through a VPXA module called vmap. With vSphere 5.0, virtual machine protection happens on several layers but is ultimately the responsibility of vCenter
Change in VM restart attempts
Prior to vSphere 5.0 the max number of restart retries that can be attempted on a VM was 6. In vSphere 5.x this has been changed and max restart retries has been limited to 5 including the initial restart attempt.
Changes in Isolation Response
When an Esxi host is isolated, HA looks at the isolation response setting and triggers whatever user has selected. In vSphere 5.0 when a slave Esxi is isolated, HA waits for 30 seconds before triggering the isolation response.
Prior to vSphere 5.0, we can configure this wait time using an advanced setting “das.failuredetectiontime” was. As of vSphere 5.0, it is no longer possible to configure this advanced setting.
In vSphere 5.1 “das.config.fdm.isolationPolicyDelaySec” is introduced and it is an advanced setting which allows changing the number of seconds to wait before the isolation policy is executed. By default wait time for triggering isolation response is 30 seconds in vSphere 5.1.
Change in HA & DPM working together.
With vSphere 4.1 and prior when you disable Admission Control and enabled DPM it could lead to a serious impact on availability. When Admission Control was disabled, DPM could place all hosts except for 1 in standby mode to reduce total power consumption. This could lead to issues in the event that this single host would fail.
As of vSphere 5.0, this behavior has changed: when DPM is enabled, HA will ensure that there are always at least two hosts powered up for failover purposes.
Enhancement in Admission Control Policy
In Vsphere 5.0 an enhancement was made in “Host Failures that cluster can Tolerates” Admission Control Policy. Pre-vSphere 5.0, the maximum host failures that could
be tolerated was 4 but with vSphere 5.0 we can specify max host failure as 31.
Change in VM Monitoring
Prior to vSphere 5.0, VM/App Monitoring was implemented by HA code in VPXA. As of vSphere 5.0, it is enabled by the HA agent itself. This means that the “VM/App Monitoring” logic lives within the HA Agent.
The agent uses the “Performance Manager” to monitor disk and network I/O; VM/App Monitoring uses the “usage” counters for both disk and network and it requests these counters once enough heartbeats have been missed that the configured policy is triggered.