Virtualization The Future: Using VPLEX Metro with VMware HA

Purpose

This article provides information about Partner Supported Products for the VPLEX Metro Distributed Virtual Volume with VMware HA and vMotion. This configuration was verified by and is directly supported by EMC.

Note: This means that this solution is not directly supported by VMware. For issues with this configuration, contact EMC directly. VMware offers best effort support to the partner to fix vSphere-related problems that may be found in the field. It is the partner's responsibility to verify that the configuration functions with future major and minor releases of vSphere, as VMware does not guarantee that compatibility with future releases is maintained.

Resolution

What is EMC VPLEX?

EMC VPLEX is a federation solution that can be stretched across two geographically dispersed datacenters separated by synchronous distances (maximum distance of separation = 100 km, maximum round trip latency = 5 msec). It provides simultaneous access to storage devices at two sites by creating a VPLEX Distributed Virtual Volume that is supported on each side by a VPLEX Cluster.

Each VPLEX Cluster is itself highly available, scaling from two directors per VPLEX Cluster to eight directors per VPLEX Cluster, where each director is supported by independent power supplies, fans, and interconnects, making each VPLEX Cluster resilient to single point failures.

For more information about Distributed Virtual Volume, see the Additional Information section in this article.

Validated Configuration

This graphic illustrates the configuration validated by EMC and VMware using VMware HA with a VPLEX Metro Deployment.

Supported Use Cases for VMWARE HA CLUSTER over distance with VPLEX Metro

This table shows the supported use cases with VPLEX configuration:

Case	Support
Simultaneous access to a shared Distributed Virtual Volume from two separate ESX clusters	Supported
HA between a host in ESX cluster 1/Data Center 1 to a host in ESX cluster 1/Data Center 2 leveraging the shared Distributed Virtual volume	Supported
Stretching a single vSphere cluster between sites leveraging a Distributed Virtual volume with VMware High Availability (HA) and Distributed Resource Scheduler (DRS)	Supported
VPLEX Metro and vSphere cluster in tandem with vSphere 4.1 DRS VM-Host Affinity Rules	Not Supported

For additional details of Distributed Virtual Volumes on VPLEX, see the Best Practices section of this article.

The DRS VM Host Affinity feature was introduced in vSphere 4.1.

Configuration Requirements

These requirements must be satisfied to support this configuration:

The maximum round trip latency on both the IP network and the Fibre-Channel Network between the two sites must not exceed 5 milliseconds (ms). The IP network supports the VMware ESX/ESXi hosts and the VPLEX Management Console, and the Fibre Channel network is required by inter-cluster links connecting the two VPLEX Clusters within VPLEX Metro.
The ESX servers in both data centers must have a private network on the same IP subnet and broadcast domain
Any IP subnet used by the virtual machine that resides on it, must be accessible from ESX servers in both data centers. This requirement is important so that clients accessing VMs running on ESX servers on both sides are able to function smoothly upon any VMware HA triggered VM restart events
The data storage locations including the boot device used by the virtual machines must be active and accessible from ESX servers in both data centers
vCenter Server must be able to connect to ESX servers in both data centers
The VMware datastore for the VMs running in the ESX Cluster are provisioned on Distributed Virtual Volumes
The maximum number of hosts in the HA cluster must not exceed 32 hosts, with 4 hosts on each site.

Note:

The ESX servers forming the VMware HA cluster could be distributed on two sites. HA Cluster can start a virtual machine on surviving ESX host, the ESX host access the Distributed Virtual Volume through storage path at its site.
VPLEX 5.0 and ESX 4.1 are tested in this configuration without the VPLEX Witness

For any additional requirement for VPLEX Distributed Virtual Volumes, see the EMC VPLEX Best practices document.

Understanding Preferred Site for Distributed Virtual Volumes

For each Distributed Virtual Volume, VPLEX defines a detach rule. When there is a communication failure between the two clusters in the VPLEX Metro, this detach rule identifies which VPLEX cluster in a VPLEX Metro should detach its mirror leg, thereby allowing service to continue. The detach rule effectively defines a preferred site if VPLEX clusters lose communication with each other. The purpose of having a defined preferred site is to ensure that there is no possibility of a "split brain" caused by both VPLEX Clusters continuing to allow I/O during communication failure.

After a complete communication failure between the two VPLEX Clusters, the preferred site continues to provide service to the Distributed Virtual Volume. The other VPLEX cluster will suspend I/O service to the volume and referred to as the non-preferred site. The detach rule is at the Distributed Virtual Volume level and hence at any given site could be the preferred site for some Distributed Virtual Volume and non-preferred site for others. A VPLEX Metro instance can support up to 5000 Distributed Virtual Volumes and each such volume has its own detach rule. It is therefore possible for the same VPLEX Cluster (and therefore the ESX servers connected to it) to be on the preferred site with respect to one Distributed Virtual Volume but to be on the non-preferred site with respect to another Distributed Virtual Volume.

There are two conditions that can cause the VPLEX clusters to lose communication:

Total VPLEX Cluster failure at one site (Failure of all directors in a VPLEX Cluster): A complete VPLEX Cluster failure triggers the detach rule behaviors since the surviving VPLEX Cluster does not have the ability to determine distinguish between interlink communication loss and VPLEX Cluster failure. As a result, Distributed Virtual Volumes whose preferred site is the surviving VPLEX Cluster will continue to run without interruption. The Distributed Virtual Volumes, whose preferred site is the failed VPLEX Cluster site, will enter into I/O suspension until manual intervention is performed.
Failure of the inter-cluster communication links (VPLEX Cluster partition): The VPLEX Cluster partition case will also trigger the execution of the detach rule. Each Distributed Virtual Volume will allow I/O to continue on its preferred site and suspend I/O on its non-preferred site.

When the VPLEX Cluster failure or VPLEX Cluster partition condition is resolved, the VPLEX Metro Distributed Virtual Volume gets re-established enabling I/O on both VPLEX Metro sites.

Tested Scenarios

Terminology

VMs running in the preferred site: This refers to VMs running on the preferred site for the Metro Distributed Virtual Volume supporting the datastore for those VMs.
VMs running in the non-preferred site: This refers to VMs running on the non-preferred site for the Metro Distributed Virtual Volume supporting the datastore for those VMs.

Scenario	VPLEX Behavior	VMware HA Impact
Single VPLEX back-end (BE) path failure	VPLEX will switch to alternate paths to the same BE Array and continue to provide access to the Metro Distributed Virtual Volumes exposed to the ESX Servers.	None.
Single VPLEX front-end (FE) path failure	The ESX server will be expected to use alternate paths to the Metro Distributed Virtual Volumes.	None.
BE Array failure (preferred site for a Metro Distributed Virtual Volume)	VPLEX will continue to provide access to the Metro Distributed Virtual Volume through the non-preferred site BE array. When access to the array is restored, the storage volumes from the preferred site BE array will be resynchronized automatically.	None.
BE array failure (non-preferred site for a Metro Distributed Virtual Volume)	VPLEX will continue to provide access to the Metro Distributed Virtual Volume using the preferred site BE array. When access to the array is restored, the storage volumes from the non-preferred site BE array will be rebuilt automatically.	None.
Single front-end switch failure (preferred site for a Metro Distributed Virtual Volume)	VPLEX will continue to provide access to the Metro Distributed Virtual Volume via alternate paths to the same VPLEX Cluster from the ESX Server.	None.
Single front-end switch failure (non-preferred site for a Metro Distributed Virtual Volume)	VPLEX will continue to provide access to the Metro Distributed Virtual Volume via alternate paths to the same VPLEX Cluster from the ESX Server.	None.
VPLEX director failure	VPLEX will continue to provide access to the Metro Distributed Virtual Volume through front-end paths available through other directors on the same VPLEX cluster.	None.
Complete site failure (where the preferred site for a Metro Distributed Virtual Volume is in the site that has failed)	VPLEX will suspend I/O on the Metro Distributed Virtual Volume on the non-preferred site. Once it is determined by the administrator that the site has failed, and it is not a case of inter-site communication failure, the volumes on the non-preferred site can be unsuspended ("resumed") using the device resume-link-down command. Note that this process is manualintentionally. While the automated resuming of I/O works in the site failure, it does not work in the VPLEX Cluster Partition case. Issuing the unsuspend command automatically on the non-preferred site would cause both sites to become simultaneously read-writeable creating a potential split brain condition.	VMs running in preferred site: VMware HA will attempt to bring up the failed VMs (up to 5 times) on the ESX Servers supported by the non-preferred site for the Metro Distributed Virtual Volumes. These attempts will fail until the volumes are unsuspended on the non-preferred site. If the HA maximum restart limit is reached, the failed VMs need to be manually restarted on the non-preferred site. VMs running in non-preferred site: These VMs will see the I/O as being suspended and the guest OS may hang during this time. If VM monitoring is turned on, the guest OS' would attempt to be reset (maximum resets is based on the VM Monitoring policy), but the attempts will fail until volume is 'unsuspended'. If the maximum reset limit of VM Monitoring is reached, the failed VMs will need to be restarted manually.
Complete site failure (where the non-preferred site for a Metro Distributed Virtual Volume is in the site that has failed)	VPLEX will continue to provide I/O access to the preferred site.	VMs running in preferred site: No impact. VMs running in non-preferred site: Given that the ESX Servers have failed, the VMs running on those ESX Servers also fail. VMware HA will automatically restart the VMs on the ESX Servers supported by the preferred site for the Metro Distributed Virtual Volume and no administrative action is necessary.
Add ESX Server(s) to the cluster	After the ESX Servers are registered and added to the appropriate VPLEX view, VPLEX will provide access to the provisioned Metro Distributed Virtual Volumes to the newly added host.	None.
Remove ESX Server(s) from the cluster	After the ESX Servers are removed from the appropriate VPLEX view and deregistered, the ESX Server can be removed.	When the ESX Servers are placed into Maintenance Mode, the VMs on the ESX host are vacated by vCenter using vMotion if DRS is enabled in the vSphere cluster. If DRS is not enabled, the VMs should be moved manually before putting the host into Maintenance Mode.
Multiple ESX Server failure(s) - power off.	None.	VMware HA will restart the VMs on any of the surviving ESX Servers within the VMware HA Cluster, as long as the VMware HA Admission Control Policy and HA release limits (at most 4 simultaneous Hosts failures) are not exceeded.
Multiple ESX Server failure(s) -Network disconnect	None.	VMware HA will restart the VMs on any of the surviving ESX Servers within the VMware HA Cluster, as long as the isolation response policy is set to "Power Off" or "Shutdown". The HA Admission Control Policy and the HA release limits (maximum of 4 simultaneous host failures) should be also abided by. If the isolation response policy is to "leave powered-on" then the VMs will continue to run on the isolated host and will be accessible for management once the host is un-isolated.
Single ESX Server and a VPLEX director failure at same site	The surviving VPLEX directors on the VPLEX cluster with the failed director will continue to provide access to the Metro Distributed Virtual Volumes.	There is no impact to VMs running on the surviving ESX Servers. VMs running on the failed ESX Server will be restarted by VMware HA on the surviving ESX Servers that are part of the vSphere Cluster using VMware HA.
Single director and back-end path failure at same site	The surviving VPLEX directors on the VPLEX cluster with the failed director will continue to provide access to the virtual volumes. VPLEX will switch to alternate paths (if available) to the same back-end and continue to provide access to the Metro Distributed Virtual Volumes.	None.
ESX Server all paths down (encountered when the ESX Server loses access to its storage volumes i.e. VPLEX Volumes in this case).	None.	Ideally the I/Os in the ESX host should resume automatically once the paths are restored. In case that does not happen, the host may need to be rebooted to resume the I/Os. If the ESX Server is restarted, this will cause VMware HA to restart the failed VMs on other surviving ESX Servers within the VMware HA cluster.
VPLEX inter-site link failure; vSphere cluster management network intact.	VPLEX will transition Distributed Virtual Volumes on the non-preferred site to the I/O suspension state. On the preferred site, the Distributed Virtual Volumes will continue to provide access. Note that in this case, I/O at the non-preferred site should not be manually unsuspended. In this case, given that both VPLEX Clusters survive, the preferred site will continue to allow I/O. Unsuspending I/O on the non-preferred site will result in the same Distributed Virtual Volume to be read-writeable on both legs creating a potential split brain condition. By restoring the inter-site links, the Distributed Virtual Volume will become unsuspended on the non-preferred site.	VMs running in preferred site: No impact. VMs running in non-preferred site: These VMs will see all I/Os as suspended and the guest OS may hang during that time. If VM Monitoring is turned on, the VM would attempt to be reset (maximum resets is based on the VM Monitoring policy) on the same host, but the attempts will fail until the Distributed Virtual Volumes are manually 'unsuspended'. If the maximum reset limit of VM Monitoring is reached, the failed VMs will need to be restarted manually. The datastore will not be marked as "unavailable", as the path is still active but no I/Os will be processed.
Complete Dual Site failure.	Upon power on of a single VPLEX Cluster, VPLEX will intentionally keep all Distributed Virtual Volumes in the suspended state even if it is the preferred site until such time as it is able to reconnect to the other site or unless the administrator manually resumes I/Os on these volumes using the device resume-link-down command. This behavior is to account for the possibility that I/Os have continued on the other site (either automatically, if the other site was preferred or manually, if the other site was non-preferred) and thereby protect against data corruption.	If ESX servers are powered back on after all the Distributed Virtual Volumes are manually resumed, the HA Cluster will recover from “Total Cluster Failure” by powering-on all failed VMs automatically. So the recovery will be automated if the Distributed Virtual Volumes and ESX server startup order is maintained. If the startup order is not maintained, the failed VMs need to be restarted manually.
Director failure at one site (preferred site for a given Distributed Virtual Volume) and BE array failure at the other site (Secondary site for a given Distributed Virtual Volume)	The surviving VPLEX directors within the VPLEX cluster with the failed director will continue to provide access to the Metro Distributed Virtual Volumes. VPLEX will continue to provide access to the Metro Distributed Virtual Volumes using the preferred site BE array.	None.
VPLEX inter-site link intact; vSphere cluster management network failure.	None.	VMs on each site will continue running on their respective hosts. NOTE: Since there is a network partition, VMware HA will attempt to restart the VMs on the other site (where the VM is not currently running) for the Metro Distributed Virtual Volumes but will fail (after 5 maximum retries - NOTE: This is configurable). This failure is desired and expected, and is governed via VMware file-level locking. Once the network partition is corrected, HA will need to be reconfigured on the cluster to fix the cluster network partition.
VPLEX inter-site link failure; vSphere cluster management network failure.	VPLEX will suspend I/O on the non-preferred site for a given Distributed Virtual Volume. The volumes will continue to have access on the Distributed Virtual Volume on its preferred site. Note that in this case, I/O at the non-preferred site should not be manually unsuspended. In this case, given that both VPLEX Clusters survive, the preferred site will continue to allow I/O. Unsuspending I/O on the non-preferred site will result in the same Metro Distributed Virtual Volume to be read-writeable on both legs creating a potential split brain condition. By restoring the inter-site networks, the Distributed Virtual Volume will become unsuspended on the non-preferred site.	VMs running in preferred site: The powered-on VMs will continue to run. This is a HA split brain situation; the non-preferred site would think that the hosts of preferred site are dead and would try to restart the powered-on VMs of the preferred site. The power on attempts will fail because of VMFS file-level locks. By default HA will attempt five retries to power-on the VMs. VMs running in non-preferred site: These VMs will see their I/O as suspended; the guest OS may hang during this time. If VM Monitoring is turned on, the VMs would attempt to reset (maximum resets is based on the VM Monitoring policy) on the same host, but the attempts will fails until the volume is unsuspended. In case the maximum reset attempts are reached, manual intervention will be needed to reset the failed VMs. The datastore will not be marked as "unavailable" as the path is still active, but no I/Os will be processed. This is also a HA split-brain situation where preferred site hosts will think that the hosts of non-preferred sites are dead and will try to restart the failed VMs. However the restart attempts will fail as the running VMs of the non-preferred hosts are holding the VMFS file locks. Once the network partition is corrected, HA will need to be reconfigured on the cluster to fix the HA split brain situation.

In these failure modes described above involving VPLEX Cluster failures, after the VPLEX Cluster joins back into the VPLEX Metro instance after recovery. If a Distributed Virtual Volume was running I/O on the peer site (either because this was the preferred site or because the administrator had manually chosen to resume I/Os), the joining VPLEX Cluster will recognize this and immediately provide the latest data back to the hosts accessing the same Distributed Virtual Volume through the joining VPLEX Cluster. Any stale data in the joining VPLEX Cluster is discarded and / or overwritten.

Best practice documents

EMC best practice documents for VPLEX are available at http://powerlink.emc.com/

Disclaimer: The partner products referenced in this article are hardware devices that are developed and supported by stated partners. Use of these products are also governed by the end user license agreements of the partners. You must obtain the application, support, and licensing for using these products from the partners. For more information, see Support Information in this article.

Additional Information

Term	Definition
Metro Distributed Virtual Volume	A VPLEX virtual volume with complete, synchronized, copies of data (mirrors), exposed through 2 geographically separated VPLEX clusters. Distributed Virtual Volumes can be simultaneously accessed by servers at two separate data centers.
Preferred Site	Distributed Virtual Volume, VPLEX defines a detach rule. When there is a communication failure between the two clusters in a VPLEX Metro, this detach rule effectively identifies which VPLEX Cluster in a VPLEX Metro should detach its mirror leg and thereby allowing service to continue. This VPLEX Cluster, which in the presence of failures, allows I/Os to continue is referred to as the preferred site for that particular Distributed Virtual Volume. Note: The other VPLEX Cluster will suspend I/O and is referred to as the non-preferred site for that Distributed Virtual Volume.

Source:-

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1026692

Virtualization The Future

Pages

Translate

Total Pageviews

My YouTube Channel

Wednesday 22 August 2012

Using VPLEX Metro with VMware HA