Effective Disaster Recovery Planning

Server Stack - @SeniorDBA

In your business, you might be the only one tasked with understanding what types of disasters can strike your business and assigned the responsibility of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.

About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.

Understand The Risk

Do you fully understand the risks to your business? Have you looked at the systems your business uses and depends on each day and thought about what would happen if those systems were unavailable? Have you thought about the common risks for the area? These risks could include tornadoes, earth quakes, hurricanes, floods, etc.

Disaster Map - @SeniorDBA

Maybe there are man-made risks unique to your location, like frequent power outages, dangerous break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters. You don’t want someone stealing your servers or hard drives in the middle of the night, or cracks in the walls leading to mice chewing through your network or power cables.

Written Plan

You need to think about each of the risks scenarios, and write down your plan for how you and your team would address each scenario to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, and there may be periodic changes as systems and risks change.

  1. List of Employees (what they do, when they do it, why they do it, etc.)
  2. Inventory Systems (office equipment, servers, laptops, etc.)
  3. Office Space Requirements (you will need space to restore your systems, but can everything be done remotely, or will the users need office space to access restored systems)
  4. Insurance and Budget Concerns (who will provide money during an actual recovery)
  5. Share The Plan (make sure you aren’t the only one with a copy of the plan, and make sure the plan can survive the disaster)

Testing

Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.

There will be issues, like a system that wasn’t included in the written plan or a technical issue that you didn’t know existed. An issue could be something a simple as unknown system passwords or a missing software installation key. But that is what a test is all about. You have to test to find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when your team is confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.

Team Meeting - @SeniorDBA

Don’t allow your business to fail because of an interruption you could have resolved with the proper planning and some simple testing.

Effective Disaster Recovery Planning

In your business, you are probably the only one tasked with understanding what types of disasters can strike your business and the task of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.

About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.

Understand The Risk

Do you even understand the risks to your business? Have you looked at the systems you business uses and depends on each day and though about what would happen if they systems were unavailable? Have you though about the common risks for the area, including tornadoes, earth quakes, hurricanes, floods, etc.?

Disaster-Map

Maybe there are risks unique to your location, like frequent power outages, danger of break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters. You don’t want someone stealing your servers or hard drives in the middle of the night, or cracks in the walls leading to mice chewing through your network or power cables.

Written Plan

You need to think about each of the risks scenarios, an write down you plan for how you and your team would address those scenarios to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, but there may be periodic changes as system and risks change.

  1. List of Employees (what they do, when they do it, why the do it, etc.)
  2. Inventory Systems (office equipment, servers, laptops, etc.)
  3. Office Space Requirements (can everything be done remotely, or will the users need office space to access restored systems)
  4. Insurance and Budget Concerns (who will provide money during an actual recovery)
  5. Share The Plan (make sure you aren’t the only one with a copy of the plan, and the plan can survive the disaster)

Testing

Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.

There will be issues, like a system that wasn’t included in the written plan, or a technical issue that you didn’t know existed, to something a simple as unknown system passwords or missing software installation keys. But that is what a test is all about. You have to test tot find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when you are confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.

Incident Recovery Planning

In your business, you are probably the only one tasked with understanding what types of disasters can strike your business and the task of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.

About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.

History

Modern technology incident recovery planning was created in the mid 1970s because organizations started to build and use computer systems. In those days the systems were large mainframe and they were fairly easy to document and to replicate for testing. By late 1978, Sun Information System (later renamed to Sungard Availability Systems) would be created in Philadelphia as the first commercial hot site vendor in the US.

While the market for companies that help businesses with disaster recovery planning grew through the 1980s, the growth of the internet caused many more companies to look at a robust solution to disaster planning. With the recent growth to cloud computing, it doesn’t matter as much where systems are located. It only matters that the systems are secure, stable, and reliable.

Understand The Risk

Do you understand the risks to your business? Have you looked at the systems you business uses and depends on each day and thought about what would happen if those systems were unavailable? Have you thought about the common risks for your area (tornadoes, earth quakes, hurricanes, blizzard, floods, wild fires, volcanic eruption, etc.) and considered how you would deal with these issues?

Disaster-Map

Maybe there are risks unique to your location, like frequent power outages, danger of break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters.

Written Plan

You need to think about each of the risks scenarios, and write down your plan for how you and your team would address those scenarios to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, but there may be periodic changes as system and risks change.

  1. List of Employees (what they do, when they do it, why the do it, etc.)
  2. Inventory Systems (office equipment, servers, laptops, etc.)
  3. Office Space Requirements (can everything be done remotely, or will the users need office space to access restored systems)
  4. Insurance and Budget Concerns (who will provide money during an actual recovery)
  5. Share The Plan (make sure you aren’t the only one with a copy of the plan, and the plan can survive the incident)

This written plan is a “living document”, it will change as often as your business changes. The idea is to keep the business running even if everything stops working. You have to look at everything important to your company, and determine how you would keep it working if there were a catastrophic failure of one or more systems that are important to your company. You don’t want to write this plan by yourself, as everyone in the business has a stake it keeping the business operational.

What would you do if your data center was struck with a tornado, hurricane, or earth quake? Would those systems be protected from damage? What if there was a major failure of your systems, the power infrastructure, telecommunications network, etc. Do you have adequate data and system backups? How long would those systems be down before you could purchase new hardware, configure the new hardware for your network, restore your data from backups, test the system integration, and implement those new systems?

You have to think of the ranges of disasters, from a single piece of hardware failure to a massive failure because of flooding or other major natural disasters. What will your response be to a data breach? Do you have any contracts or agreements that will allow you to borrow or rent any required hardware or software that will get you through the first 30 days of a disaster? Do you expect to download your backups or installation media from the internet? What if there isn’t any internet access, your backup site is down, or the access is too slow to make it useful.

Begin small by making a plan that addresses the most likely disasters. Then work your way up from there, adding new scenarios as you uncover new possible issues or the scope of your environment changes.

Testing

Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.

There will be issues, like a system that wasn’t included in the written plan, or a technical issue that you didn’t know existed. It could be something a simple as unknown system passwords or missing software installation keys. But that is what a test is all about. You have to test to find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when you are confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.

If the most likely disaster in your environment is hardware failure, then that should definitely be something you evaluate and test at least once per year. Call your vendors and ask them to verify your service level agreement (SLA) to make sure your expectations match their support agreements. You should also disperse your hardware spare pool to a second location.

If you are at great risk of tornado or hurricane, then you have to analyze how well you have protected your environment from the negative impacts of severe weather. Look at the backup power supply fuel sources and verify the methods of dealing with raising flood waters.

You should be testing those backup systems, verifying your backup tapes, testing your ability to replace a physical server or network switch, and reviewing the plan so that you know the process is adequately documented.

Disaster Recovery Planning

In your business, you are probably the only one tasked with understanding what types of disasters can strike your business and the task of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.

About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.

Understand The Risk

Do you even understand the risks to your business? Have you looked at the systems you business uses and depends on each day and though about what would happen if they systems were unavailable? Have you though about the common risks for the area, including tornadoes, earth quakes, hurricanes, floods, etc.?

Disaster-Map

Maybe there are risks unique to your location, like frequent power outages, danger of break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters. You don’t want someone stealing your servers or hard drives in the middle of the night, or cracks in the walls leading to mice chewing through your network or power cables.

Written Plan

You need to think about each of the risks scenarios, an write down you plan for how you and your team would address those scenarios to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, but there may be periodic changes as system and risks change.

  1. List of Employees (what they do, when they do it, why the do it, etc.)
  2. Inventory Systems (office equipment, servers, laptops, etc.)
  3. Office Space Requirements (can everything be done remotely, or will the users need office space to access restored systems)
  4. Insurance and Budget Concerns (who will provide money during an actual recovery)
  5. Share The Plan (make sure you aren’t the only one with a copy of the plan, and the plan can survive the disaster)

Testing

Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.

There will be issues, like a system that wasn’t included in the written plan, or a technical issue that you didn’t know existed, to something a simple as unknown system passwords or missing software installation keys. But that is what a test is all about. You have to test tot find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when you are confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.

Disaster Recovery Planning

In your business, you are probably the only one tasked with understanding what types of disasters can strike your business and the task of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.

About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.

Understand The Risk

Do you even understand the risks to your business? Have you looked at the systems you business uses and depends on each day and though about what would happen if they systems were unavailable? Have you though about the common risks for the area, including tornadoes, earth quakes, hurricanes, floods, etc.?

Disaster-Map

Maybe there are risks unique to your location, like frequent power outages, danger of break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters. You don’t want someone stealing your servers or hard drives in the middle of the night, or cracks in the walls leading to mice chewing through your network or power cables.

Written Plan

You need to think about each of the risks scenarios, an write down you plan for how you and your team would address those scenarios to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, but there may be periodic changes as system and risks change.

  1. List of Employees (what they do, when they do it, why the do it, etc.)
  2. Inventory Systems (office equipment, servers, laptops, etc.)
  3. Office Space Requirements (can everything be done remotely, or will the users need office space to access restored systems)
  4. Insurance and Budget Concerns (who will provide money during an actual recovery)
  5. Share The Plan (make sure you aren’t the only one with a copy of the plan, and the plan can survive the disaster)

Testing

Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.

There will be issues, like a system that wasn’t included in the written plan, or a technical issue that you didn’t know existed, to something a simple as unknown system passwords or missing software installation keys. But that is what a test is all about. You have to test tot find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when you are confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.

Realistic Disaster Recovery Planning

As most people in Information Technology know, you have to make written plans to support incidents, also known as disasters, and how will systems recover from those incidents. Most people plan for the common events:

  • Fire
  • Power Outage
  • Building Issues (Roof Collapse, Water Outage, Police Evacuations, etc.)
  • Earthquake Damage
  • Tornado and Storm Damage
  • Data Outage (Construction Damage, Equipment Outage, etc.)
  • Area Flooding
  • Hurricane

These plans usually include what systems must be recovered in what order, usually including location and system specific information to include instructions on order of steps, procedures to follow, locations of backup tapes, scripts to execute, etc.

Disaster-Recovery

Does your written plan include instructions and procedures for people? What if the incident involves a building collapse that kills half of your technology team? Will the remaining people have the knowledge and skills to perform a full system recovery at an off-site facility? Will they want to perform and support that recovery while potentially grieving the loss of their coworkers and fiends just minutes or hours ago?

What happens if a hurricane is heading directly for your data center? You have a plan that says everyone goes to an off-site location and recovers the business, but what about families, friends, pets, etc.? Will those people you have listed as the essential personnel going to be willing to drop everything and travel to that off-site location and focus on recovery of business systems without knowing if their friends and family are safe?

When developing your incident recovery plan, make sure you keep your schedule and planning grounded in reality and don’t forget the plan depends on people doing specific activities during some of the most traumatic times of their lives.

Schematic_ITSC_and_RTO,_RPO,_MI