Wednesday, February 6, 2013

Metro Storage Cluster

Metro Storage Clusters can be designed to maintain data availability beyond a single physical or logical site. In simple terms this is a stretched VMware ESXi cluster between two sites with a stretched stoarge system.A Metro Cluster configuration consists of two storage controllers, each residing in the same datacenter or two different physical locations, clustered together. It provides recovery for any single storage component or multiple point failure, and single-command recovery in case of complete site disaster. Metro Storage Clusters can be created with different storage systems like NetApp, EMC, HP,IBM etc.
In this article we would discuss about Metro cluster with NetApp storage systems and their specific requirements and solution overview.
Metro Cluster using NetApp:
MetroCluster leverages NetApp HA CFO functionality to automatically protect against controller failures. Additionally, MetroCluster layers local SyncMirror, cluster failover on disaster (CFOD), hardware redundancy, and geographical separation to achieve extreme levels of availability. No data loss occurs because of synchronous mirroring. Hardware redundancy is put in place for all MetroCluster components. Controllers, storage, cables, switches (fabric MetroCluster), and adapters are all redundant.
A VMware HA/DRS cluster is created across the two sites using ESXi 5.0 or 5.1 hosts and managed by vCenter Server 5.0 or 5.1. The vSphere Management, vMotion, and virtual machine networks are connected using a redundant network between the two sites. It is assumed that the vCenter Server managing the HA/DRS cluster can connect to the ESXi hosts at both sites.
Based on the distance considerations, NetApp MetroCluster can be deployed in two different configurations:


  • Stretched Metro Cluster:
This setup is ideal for two sites upto 500m range.
  • Fabric MetroCluster
This setup is for sites seperated upto 100km range



















Configuration Requirements
These requirements must be satisfied to support this configuration:
  • For distances under 500 m, stretch MetroCluster configurations can be used, and for distances over 500 m but under 160 km for systems running ONTAP version 8.1.1, a Fabric MetroCluster configuration can be used.
  • The maximum round trip latency for Ethernet Networks between two sites must be less than 10 ms, and for syncmirror replications must be less than 3 ms.
  • The Storage network must be a minimum of 1 Gbps throughput between the two sites for ISL connectivity.
  • ESXi hosts in the vMSC configuration should be configured with at least two different IP networks, one for storage and the other for management and virtual machine traffic. The Storage network will handle NFS and iSCSI traffic between ESXi hosts and NetApp Controllers. The second network (VM Network) will support virtual machine traffic as well as management functions for the ESXi hosts. End users can choose to configure additional networks for other functionality such as vMotion/Fault Tolerance. This is recommended as a best practice but is not a strict requirement for a vMSC configuration.
  • FC Switches are used for vMSC configurations where datastores are accessed via FC protocol, and ESX management traffic will be on an IP network. End users can choose to configure additional networks for other functionality such as vMotion/Fault Tolerance. This is recommended as a best practice but is not a strict requirement for a vMSC configuration.
  • For NFS/iSCSI configurations, a minimum of two uplinks for the controllers must be used. An interface group (ifgroup) should be created using the two uplinks in multimode configurations.
  • The VMware datastores and NFS volumes configured for the ESX servers are provisioned on mirrored aggregates.
  • The vCenter Server must be able to connect to ESX servers on both the sites.
  • The maximum number of Hosts in an HA cluster must not exceed 32 hosts.
  • A MetroCluster TieBreaker Machine  (MetroCluster TieBreaker (MCTB) Solution is a plug-in that runs in the background as a Windows service or Unix daemon on an OnCommand Unified Manager (OC UM) host) should be deployed in a third site, and must be able to access the storage controllers in Site one and Site two in order to initiate a CFOD in case of an entire site failure.
  • vMSC certification testing was conducted on vSphere 5.0 and NetApp Data ONTAP version 8.1 operating in 7 mode. 

Disaster Recovery vs Disaster Avoidance

One thing is for sure, that you cannot mix disaster recovery for disaster avoidance; both are separate even though it sounds similar.
Let’s try to understand the difference between these concepts using VMware Stretched clusters and VMware Site Recovery Manager.
Stretched clusters consist of two or more physical ESXi hosts deployed in separate sites that are less than 100 kilometres apart and are contained within a single vSphere cluster. This simply means that a VMware cluster of several hosts located at different sites using the same vCenter server.
By using Stretched cluster you can vMotion a VM from one site to another without any downtime, that’s Disaster Avoidance – A solution where  you cannot afford to have minimal downtime for your DR solution. In short it’s a very good Active site balancing solution.
In case of VMware SRM, the VM would start at the recovery site when the storage snap is attached to it by a sequence of automated scripts. Now this is a solution which would require the restart of the VM, hence downtime is mandatory. That’s a normal Disaster Recovery situation. Moreover you would require two vCenter servers for your SRM DR solution.
Now, let’s see the difference in a tabular format
However it’s important to note that both the solutions have their benefits according to the requirement. Both solutions enhance service availability, while stretched clusters focus on data availability and service mobility Site Recovery Manager focuses on controlled and repeatable disaster recovery processes to recover from outages.