Business Continuity and Disaster Recovery

Introduction – Please Read First

These questions are here to assist with ensuring that you’re gathering necessary information from a customer/prospect in order to put together an appropriate solution to meet their requirements in addition to capturing specific metrics from tools like Collector or RVTools. 

This list is not exhaustive, but should be used as a guide to make sure you’ve done proper and thorough discovery.  Also, it is imperative that you don’t just ask a question without understanding the reason why it is being asked.  We’ve structured these questions with not only the question that should be asked, but why we are asking the customer to provide an answer to that question and why it matters to provide an optimal solution. 

Questions marked with an asterisk (*) will likely require reaching out to a specialist/Solution Architect resource at Nutanix to go deeper with the customer on that topic/question.  Make sure you use the answers to these questions in the Scenario Objectives in Sizer when you create a new Scenario.  These questions should help guide you as to what the customer requirements, constraints, assumptions, and risks are for your opportunity. 

This is a live document, and questions will be expand and update over time.


BCDR

 1.  Has your organization done a Business Impact Analysis of your applications and workloads, and would you be able to share that with us?

Why ask? A business impact analysis should have rated the customer’s applications based on business criticality and may have informed them as to which are the most important applications and what level of data loss and time loss would be acceptable, what the impact in both revenue and non-revenue would be, what the systems and dependencies of those critical applications and processes are.  This information would be HIGHLY beneficial to have in designing a DR solution for your customer.

2.  If a Business Impact Analysis is not available, please breakdown applications and their corresponding recovery point objectives and associated recovery time objectives.  Also breakdown different retention requirements for on-prem and the recovery site.

Why ask? Recovery point objectives lets us know what Nutanix products can be used to meet the objective and the RTO (recovery Time Objective) may cause the design to shift depending on how fast the application needs to be back up and running. It is also good to get this backed by the business versus the IT department making the decisions. We should also mention that DR and Backup are different. The retention for DR should be as long as is needed to meet the RTO. If you have a very long RTO it is probably best served by a more cost effective backup solution.

 3.  For the applications listed please list the change rate for your applications. You can use existing backups and the deltas between them to find out the change rate. For any aggressive RPO that may require near-sync you will also want to know the write throughput for the application.

Why ask? Write throughput is the best indicator for network bandwidth throughput with near-sync.

 4.  What backup and/or DR products do you currently use?

Why ask? Helps to find any synergies or aid to move them to products which are more Nutanix friendly.  Example: if they are using Zerto, they may be great candidates for near-sync.

 5.  In the event of a failure, who is on-call to help with the restore. Will everyone be trained on recovery plans and procedures?

Why ask? This should help the customer to consider the operational implications of DR and help them see the need to simplify the recovery process. When bad events happen everyone needs to feel comfortable and empowered to help with the restoration.

 6.  What are you using today for Disaster Recovery and have you tested the solution to verify it works as expected?  How has the solution met or not met your requirements and expectations?

Why ask? Discovery to uncover the current topology and the how it is currently performing.

 7.  Are there any other systems that need to be taken into account for a proper disaster recovery plan that would be outside of the Nutanix infrastructure? (physical servers, mainframe, other physical devices)

Why ask? This is to make sure that the customer has thought through the entire scope necessary for a DR based on their requirements.

8.  What type of disasters are you planning for?

Why ask? Depending on the type of disaster scenario that could involve different products or operational procedures (for example, think of the difference of a disaster like a pandemic where all your systems are working, but your workforce can’t come in the office versus a geographic event like an explosion or earthquake in comparison to a ROBO office server closet being flooded, or even a ransomware attack).  Help the customer walk through different scenarios and possible reasons for needing to invoke a recovery plan and how we can help give them a recover in depth strategy (snapshots, backup, remote DC DR, stretch clusters, Cloud Provider DR, DRaaS with Xi Leap)

9.  Can you sites run independently? If so, how long?

Why ask? This helps to determine the criticality of each site

10.  What is your Retention Policy, and who made that decision (regulatory or self-imposed)?

Why ask? This helps to understand what is driving the need for retention and if it is internal or externally mandated.

11.  Do you have any regulatory requirements such as HIPAA, PCI, SEC, SOX?

Why ask? This will impact the particular design of the recovery solution, what features and controls will need to be in place (encryption, data sovereignty, RBAC, Logging, etc.)

12.  Do you need immutable copies of the data/VMs?

Why ask? These questions allow SE to determine sophistication of customers as well as how long they need the data. If there is a 7 year policy, customers will most likely need a 3rd Party backup tool in order to tier the data to an Object Store that supports WORM (Write Once Read Many).  This could also stem from requirements to help mitigate the risk of ransomware.

13.  Is this going to be a net new DR Solution?

Why ask? Do we have the flexibility to design a net new DR strategy?

14.  Will the source and destination clusters both be Nutanix clusters?

Why ask? If Nutanix is not the source and destination, then it will cause constraints for the design of the DR Solution.  (i.e. it will force hypervisor choice, replication can’t be array based but will need to be done with a separate software product like SRM or Zerto).  It may also require licensing changes

15.  Do you have a requirement for the backups/DR to be on separate hardware?

Why ask? Understand how to architect and select the infrastructure for the backup and DR targets

16.  What does your desired replication topology look like?

Why ask? Gives us the information on whether we need Professional or Ultimate AOS licenses and helps to map out the topology for replication (i.e. A->B->C; A->B; A->B and A->C).  This also allows us to discover if sites are active/passive and exactly what the definitions for active and passive mean (i.e. a data center with power, or VMs ready to power on, or VMs already powered on and able to switchover immediately, etc.)

17.  Does DR need to be the same performance as production?

Why ask? This helps to give an understanding of if DR is a checklist item for them or a significant business requirement (i.e. they’ll lose significant money in the event of any downtime of the production site).  Allows you to be able to size the solution appropriately.  Ensure to get sign off if they do decide to allow for DR to be undersized.

18.  Do you need to have replicated copies of all the VMs but only plan to restore a subset?

Why ask? Helps determine potential licensing requirements and size of target cluster.

19.  How often do you test your disaster recovery plan and what does the plan look like?

Why ask? This will help with understanding if the customer has actually validated any disaster recovery plan that they have implemented to ensure that it will actually work.  Oftentimes folks are optimistic with regards to how well their plan will actually work, so having tested it brings a sense of reality to the plan and can help them course correct.

20.  Do you have existing runbooks that are used in the event of a disaster?

Why ask? Knowing this will help us understand their processes in the event of a disaster.  Also, if they don’t have any runbooks this will let us know we may need to help them out in putting a plan together that they can use and test.

21.  Databases: How are you currently backing up and protecting your database environments?

Why ask? This can give us information about how they are protecting their most critical assets.  They may already be using or licensed for database level replication or clustering technology which would give them more granular levels of control than a traditional storage replicated VM.  This can also help as part of the discussions around RPO and RTO for these more critical systems which may need higher levels of availability than other parts of the infrastructure. 

22.  What hypervisor(s) are you using? Are you open to cross-hypervisor DR?

Why ask? This will let us know what replication products we can use (if the source or destination cluster is non-Nutanix) and can open up the possibility of leveraging AHV for the target cluster if they are using ESXi on Nutanix as the source/primary cluster. 

23.  Do you have a separate network for DR traffic? Do you require encryption on those links?

Why ask? This helps with understanding if network segmentation is necessary to be configured in Prism for DR Replication traffic and whether or not the customer needs to supply encryption in flight for that network.  

24.  What is the current bandwidth between sites that you plan to use for DR replication? Also, what is the latency between those sites?

Why ask? We need to know how big the pipes are between sites so that we can ensure that the RPO the customer has defined as their requirement will be able to be met based on the rate of change.  Also ensuring that the latency between sites meets the minimum requirements listed for Metro Availability or the Metro Witness.  

25.  What is the current rate of utilization of the network links between the sites you plan to use for replication traffic?

Why ask? These links may be used for other traffic and could impact the available bandwidth that you assume you will have access to for replication traffic.  See if you can get utilization over a 30 day period, and if possible over several months to see any trends of increase or decrease in utilization.

26.  How do you handle IP addresses on the recovery site for VMs that have failed over?

Why ask? This allows us to discover what type of networking failover scenario(s) the customer would prefer to use: Overlay Networks; Stretched Layer 2 subnet between sites; perform a full subnet failover from the primary to secondary site by updating routes to point to the new recovery site; allow IP addresses to change when failed over (this can cause obvious challenges of broken applications that hard code IP addresses, updating of DNS and cache entries, etc.).