Hi all ..
During my study for VCAP-DCD, I listened to the Business Continuity and Disaster Recover Design Workshop (BCDR Workshop) by VMware. It's online course for 4.5 Hrs and can be found in this link.
This workshop is really building a base for practicing DR Sites and Business Continuity Plans and it's really helpful. If you're VCP level or lower you may find this summary incomplete and you need to review the full modules, but if you're VCAP level or higher I think this can summary up the modules for you.
This is a summary for the first module of this workshop, which contains the most important notes I took during listening to it online. Let's Start:
Disaster Definition:
Disaster definition isn't the same for all organizations. But in general, it means a certain happening or event that causes a major damage to the business or the organization.
It may be classified based on its cause: (Natural/Man-made) or area of effect (Catastrophe: Wide Geographical Area/Disaster: Certain Building or Data Center/Service Disruption:failure of single application or component inside the Data Center). Any disaster and its effect can be mitigated using entire DR Plan or parts of it.
DR Sites Types:
Dedicated vs. Non-dedicated: Dedicated DR site is a site with idle hardware to be used only by failed-over systems in case of a disaster, while non-dedicated DR site is a site -usually regional campus- where there’s another production environment and some of its capacity is reserved for failover in case of a disaster. Dedicated DR Sites - and only Dedicated type- can be Hot, Warm or Cold.
Hot vs. Warm vs. Cold: Hot DR site can be failed over to in case of a disaster in duration of minutes to hours. Warm DR site requires duration of hours to few days to be ready for failover. Cold DR site requires many days to be ready for failover.
Disaster Recovery Plan (DRP) vs. Business Continuity Plan (BCP):
DRP: A plan contains all procedures and steps to be made during and right after the disaster to fail all the systems to the DR site and get all the systems back online AFAP. It also includes all the procedures to protect personnel and asserts during the disaster.
BCP: A plan contains all procedures required for running the systems and keep them online at the DR site with the max. available capacity can be used there. In case for non-dedicated DR site, BCP may also include the required procedures about how to run both recovered original system and production system at the DR site side by side with and interference. It also includes all the steps and procedures required to fail the systems back to the original site after recovering from the disaster.
Steps of Creating DRPs & BCPs:
1-) Management Buy-in: Management should agree on costs of DRP & BCP required. It includes software required for replication, HW required and any other facility. All levels of management should participate in developing DRP & BCP, testing them and executing DRP and BCP when required.
2-) Performing Business Impact Analysis (BIA): This includes:
a-) Identify Key Assets: Determining the most important items to be protected, like: software, user data, blueprints and implementation documents, etc. In addition, it’s important to identify the critical business functions and map them to the key assets identified map how these critical functions depend on each other and on key assets for continuity of the business.
b-) Define Loss Criteria: Defining the impact of losing any of the business key assets to define the priority of these assets to the business.
c-) Define Maximum Tolerated Downtime (MTD): MTD is the max. downtime of any key asset after which a major damage to the business will occur and business continuity can’t be maintained. MTD is defined as the following categories:
i-) Critical: minutes. to hours downtime.
ii-) Urgent: hours to 1 day downtime.
iii-) Important: within 3 days downtime.
iv-) Normal: up to 14 days downtime.
v-) Non-important: up to 30 days or more downtime.
3-) Define RPO (Recovery Point Objective): Defining RPO indicates how much data loss the business can tolerate, measured in time, for example: RPO is 1 hr means that data must be restored to its original state 1 hr before the disaster. Data not covered by the RPO are lost forever. For the previous example, the data within the last hr before the disaster is lost forever and can be tolerated by the business.
4-) Define RTO (Recovery Time Objective): Defining RTO indicates how much downtime the business can tolerate with major damage.
5-) Perform Risk Assessment: Defining all available risks around the business, their possibility and the possible impact to avoid them. Risk Assessment should include all natural and man-made risks.
6-) Examine Regulatory Compliance: Always check for any legal requirements that DRP and BCP should fulfill.
7-) Develop DRP: By completing all the previous steps, all the required analysis is done and DRP can be developed correctly. DRP should contain all the pre-defined RPO/RTO, the key assets to protect and
the procedure to bring all critical systems back online.
8-) Design DR Systems: This includes choosing the DR site and if it’ll be Dedicated/Non-dedicated and Hot/Warm/Cold. It also includes designing storage replication system with planned backups and network required for replication and operation of the DR system in case of a disaster. It also includes all the hardware/software required for failover of the main site.
9-) Creating Run-books: Run-book is a document contain all the required steps and procedures to fail the system over to the DR site in case of a disaster. It includes step-by-step guide for re-building the system from scratch, reloading the critical applications and user data. Re-operate the applications for users to continue to work. Each DR site should contain its own run-books, each for certain key asset and ordered to be used in specific order based on DRP and RPO & RTO of each key asset. Any run-book should take into consideration the difference between the main site and the DR site in configurations and facilities. Run-book is hard to be maintained as systems and applications used are fast changing as well as their dependencies which will affect their restoring techniques and restoring order.
10-) Develop BCP: BCP should contain all the required steps for maintaining systems and applications daily operations at the DR site, such as: daily backups. It also should include the detailed solution of all expected problems -resulted of lack of some resources and facilities at the DR site- as well as the detailed procedures to fail all systems and applications back to the main site after recovery from the disaster.
11-) Test DRP and BCP: DRP & BCP should be tested frequently to show any problems with them. It must be done carefully in order not to disrupt production systems, specially in case of non-dedicated DR site.
Share the Knowledge ...
**Update Log:
**08/11/2014: Update Dedicated vs. Non-dedicated DR Sites Comparison.