Do you need clustered systems or off-site failover? How frequently should backups be performed? What about SQL transaction logs?
These questions can only be answered once you have an agreed upon RTO and RPO with the business units. So what exactly are they?
Recovery Time Objective (RTO)
The recovery time objective is the time it takes to restore a service in the case of a disaster. This could include restoring the underlying systems, user data and performing tests.When developing an RTO, it may be beneficial to create to a timeframe depending on the failure type. For example, in certain improbably scenarios with high impact such as a major fire in the server room, your RTO might be several days. On the other hand, a more typical failure, such as a failed disk array or destructive virus, may have an RTO of several hours.
Obviously it makes sense to put your resources (money) towards the more likely scenario whereas planning for Armageddon doesn't (unless the business you're supporting thinks otherwise).
Additionally, you may specify an RTO for complete service recovery and a separate one for more critical components of the service (e.g. the RTO for sending and receiving email could be 2 hours, but the RTO for the spam filtering and quarantine component could be 2 days.)
What type of RTO can you meet? Take a moment to consider the following things:
- How long does it take to retrieve tapes?
- ...copy gigabytes or terabytes of data from tape or disk?
- ...source replacement equipment?
- ...reconfigure the application after data is restored?
- ...perform testing to make sure your RPO was met?
- Who needs to be contacted to perform the recovery? What about service providers?
- Where is the recovery going to take place?
- What supporting systems does this service rely on?
There are a lot of other things to be considered too and many can only be flushed out by testing your recovery procedures.
Recovery Point Objective (RPO)
The recovery point objective is the maximum amount of data loss that could occur in a disaster. This is specified an amount of time counting back from the failure that changes to data are missing. For example, if you run nightly backups and rotate them offsite every morning, your RPO could be 36 hours in a 24x7 business (i.e. catastrophic failure just before tapes were rotated).When the RPO is developed, it is going to take fairly intimate knowledge of how each service works in order to determine what type of data loss can occur in particular failure scenarios. For example, did you know that the Exchange Hub Transport roll may store queued email for several days and it is not included in your database backup? Are your developers caching changes to memory for some internal webapp for several hours and not flushing them to the database?
As with the RTO, you may want to consider multiple levels depending on the impact of the failure that occurred. Is a weekly offsite rotation OK, considering the likelihood of your onsite storage being damaged? Or do you need a SAN shipping data to an offsite replica in near-real-time?
The recovery time and recovery point objective can vary widely between organizations or even by services within an organization. Additionally, the costs of your recovery systems skyrocket as those RTO and RPO numbers come down. However, it is hard to spend a single dime without have a discussion with the business units or customers to determine what RTO and RPO should be targeted in your disaster recovery plan.