During the past weeks i have received a couple of questions about the VM Component Protection (VMCP) feature introduced in vSphere 6.0. The VMCP feature adds another layer of protection for the virtual machines since you can configure different behaviours for different storage related problem such as All Path Down (APD) and Permanent Device Loss (PDL).
I know there are many blog posts describing APD and PDL so i’ll only provide a very short description.
All Path Down (APD) – The ESXi hosts detects a problem with the storage device, can’t access the storage device, and cannot determine if this device will be gone forever or come back in X seconds. During the APD timeout period the ESXi host continues to send I/O requests to the storage device and when the APD timeout is reached the VM I/O will still be send to the storage device but non VM I/O (e.g. mounting storage device) will be fast-failed with status NO_CONNECT. APD timeout has been around for a few years but before we could have the ESXi hostd process taking all the CPU making the ESXI host more or less unresponsive.
Permanent Device Loss (PDL) – The PDL is different form the APD since the ESXi host will receive information that the storage device is unavailable form the storage array via a SCSI sense code. I/O will be stopped immediately form the ESXI host perspective when the PDL state of a storage device is detected. PDL configuration has also been around for some time but was configured via advanced settings or via specific files on the ESXi hosts.
I have previously written about PDL and how you can configure the behaviour during a PDL event in the following blog posts:
- ESXi host VMkernel.Boot.terminateVMOnPDL configuration using PowerCLI
- ESXi host disk.terminateVMOnPDLDefault configuration using PowerCLI
- VMware stretched cluster and VM swap file placement
- ESXi host advanced settings
APD and PDL time schedule
The following time schedule applies to APD and PDL:
- APD
- 0s: APD – Timer start
- 140s: APD ESXi host declares an APD timeout & will fast fail non VM I/O to the affected devices. This timeout period can be changed.
- 140s-320s: APD – Time after APD timeout is reached but before VMCP timeout is reached. If the failed storage device comes back online during this period of time you can either leave the VM as is or hard reset the affected VMs via the “Response for APD recovery after APD timeout” configuration option.
- 320s: APD – VMCP timeout and the action configured for “Response for Datastore with All Paths Down (APD)” will be started.
- PDL
- 0s: PDL – VMs will be restarted on healthy ESXi hosts in the vSphere cluster.
Back to the VMCP which has a timeout of 5m & 20s that includes the APD timeout of 140s. This feature lets you configure different options compared to what was available in previous version for APD and PDL. You enable the VMCP via the vSphere HA configuration by clicking the check box next to “Protect against Storage Connectivity Loss”
The different configuration options available for VMCP are:
- VM restart priority
- Response for Host Isolation
- Response for Datastore with Permanent Device Loss (PDL)
- Response for Datastore with All Path Down (APD)
- Response for APD recovery after APD timeout
I will briefly cover the three new configuration options:
- Response for Datastore with Permanent Device Loss (PDL)
- Three configuration options available:
- Disabled – No actions are taken for the affected VMs.
- Issue events – The administrator will be notified but no actions are taken for the affected VMs.
- Power off and restart VMs – Affected VMs will be killed on the ESXi host or the ESXi hosts and vSphere HA will try to start the VMs of ESXi hosts with storage device connectivity.
- Three configuration options available:
- Response for Datastore with All Path Down (APD)
- Four options are available:
- Disabled – No actions are taken for the affected VMs.
- Issue events – The administrator will be notified but no actions are taken for the affected VMs.
- Power off and restart VMs (conservative) – Affected VMs will be killed by the ESXi host or ESXi hosts when they have determined that other ESXi hosts will be able to start the VMs. Communication with the vSphere HA master will determine is enough resources are available. If the affected ESXi hosts cannot communicate with the vSphere HA master they will take no action.
- Power off and restart VMs (aggressive) – Affected VMs will be killed by the ESXi host or ESXi hosts even if they cannot determine if other ESXI host(s) can start the VMs. This state happens when the vSphere HA master is not available and killed VMs might not be able to start on other ESXi hosts based on resource constraints.
- Four options are available:
- Response for APD recovery after APD timeout. This means if vSphere HA should take any action after the APD timeout (140s) has been reached but before the VMCP timeout (320s) has been reached if the storage device comes back online.
- Two options are available:
- Disabled – No actions are taken for the affected VMs.
- Reset VMs – The affected VMs will be reset (hard reset) on the same ESXi host as where they were running before the APD happened.
- Two options are available:
Other vSphere 6.0 related blog posts can be found here.
2 pings