Disaster Recovery, also known as Incident Recovery, is the process to getting from an unplanned event to normal operations in a predetermined amount of time. If you have a server outage, you should have predicted the amount of acceptable outage (driven by the steps required to recover from the outage) and practice the recovery steps to demonstrate your ability to perform the indicated recovery steps in the time allotted.
As a simple example, we will assume you have a database server that is important to your business. The process owners might say to you, the IT Professional, that this server should never go down. You discuss the impractical nature of such a requirement, and you compromise on the outage requirement by agreeing that if it ever has an unplanned outage you will make sure it is back online within 2 hours.
What you have probably just done is paint yourself into a corner by agreeing to something that you can’t successfully complete.
First, lets understand a couple of important terms used to generally discuss Disaster Recovery.
RTO – The Recovery Time Objective (RTO) is the duration of time within which a business process must be restored after a disaster in order to avoid unacceptable consequences associated with an outage.
RPO – Recovery Point Objective (RPO) describes the interval of time that might pass during a disruption before the quantity of data lost during that period exceeds the maximum allowable threshold you have with the process owners.
Using these two terms, we have a RTO (Recovery Time Objective) which, in our same example above, is just two hours. How we do that will give us our real RPO (Recovery Point Objective) data, but we don’t have an RPO in our example. If we plan on restoring the system backup to handle our imaginary database server outage, we have to look at two important pieces of information: how long will it take to restore the server; and when was the backup completed.
In our example, the backup is small and will only take about 45 minutes to restore. That means we will easily hit our RTO, provided we configure the proper alerts and establish the ability to instantly respond to an outage. Can you imagine how difficult that will be at 3 am?
The system backup is completed at 2 am, each and every day, in our imaginary example. That means if the server fails at 3 pm today, and we only take 45 minutes to restore the backup, we might think we are providing a successful Disaster Recovery in our allowed time window. The server was restored to the point in time from 2 am, meaning that at 3:45 (the point in which we have finished restoring the server backup) we are now missing the data from 2 am through 3:45 pm. What was the Recovery Point Objective from our process owner? I can guarantee you that if you haven’t discussed this in advance, the process owner will assume the RTO and RPO were both assumed to be the same thing and you have probably failed to meet their expectations through a lack of shared understanding and communication.
To avoid this issue, you need to understand both the RTO nd RPO requirements from the process owner. I also advise you to get these values in writing, and develop the written process and procedures document to meet these goals on each and every system. This means writing a Disaster Recovery Plan, a document that lists the steps required to recovery from all types of disasters. These disaster types should include natural events like fire, flood, earthquakes, tornadoes, as well as man-made events like human error, hacking, data breaches, etc.
Your response to a server that has crashed because of hacking should probably be different than if the server has crashed because of a defective hard drive. Your RTO and RPO might also be different to support those different responses. If our imaginary server is critical to the imaginary business, the process owners probably don’t care why the server is down and they only know the server has to be operational or the business will suffer.
You have some important thinking to do so you should get started either writing a plan, or reviewing your existing plan, before it is too late.