Effective Disaster Recovery Planning

Server Stack - @SeniorDBA

In your business, you might be the only one tasked with understanding what types of disasters can strike your business and assigned the responsibility of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.

About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.

Understand The Risk

Do you fully understand the risks to your business? Have you looked at the systems your business uses and depends on each day and thought about what would happen if those systems were unavailable? Have you thought about the common risks for the area? These risks could include tornadoes, earth quakes, hurricanes, floods, etc.

Disaster Map - @SeniorDBA

Maybe there are man-made risks unique to your location, like frequent power outages, dangerous break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters. You don’t want someone stealing your servers or hard drives in the middle of the night, or cracks in the walls leading to mice chewing through your network or power cables.

Written Plan

You need to think about each of the risks scenarios, and write down your plan for how you and your team would address each scenario to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, and there may be periodic changes as systems and risks change.

  1. List of Employees (what they do, when they do it, why they do it, etc.)
  2. Inventory Systems (office equipment, servers, laptops, etc.)
  3. Office Space Requirements (you will need space to restore your systems, but can everything be done remotely, or will the users need office space to access restored systems)
  4. Insurance and Budget Concerns (who will provide money during an actual recovery)
  5. Share The Plan (make sure you aren’t the only one with a copy of the plan, and make sure the plan can survive the disaster)

Testing

Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.

There will be issues, like a system that wasn’t included in the written plan or a technical issue that you didn’t know existed. An issue could be something a simple as unknown system passwords or a missing software installation key. But that is what a test is all about. You have to test to find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when your team is confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.

Team Meeting - @SeniorDBA

Don’t allow your business to fail because of an interruption you could have resolved with the proper planning and some simple testing.

Effective Disaster Recovery Planning

In your business, you are probably the only one tasked with understanding what types of disasters can strike your business and the task of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.

About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.

Understand The Risk

Do you even understand the risks to your business? Have you looked at the systems you business uses and depends on each day and though about what would happen if they systems were unavailable? Have you though about the common risks for the area, including tornadoes, earth quakes, hurricanes, floods, etc.?

Disaster-Map

Maybe there are risks unique to your location, like frequent power outages, danger of break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters. You don’t want someone stealing your servers or hard drives in the middle of the night, or cracks in the walls leading to mice chewing through your network or power cables.

Written Plan

You need to think about each of the risks scenarios, an write down you plan for how you and your team would address those scenarios to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, but there may be periodic changes as system and risks change.

  1. List of Employees (what they do, when they do it, why the do it, etc.)
  2. Inventory Systems (office equipment, servers, laptops, etc.)
  3. Office Space Requirements (can everything be done remotely, or will the users need office space to access restored systems)
  4. Insurance and Budget Concerns (who will provide money during an actual recovery)
  5. Share The Plan (make sure you aren’t the only one with a copy of the plan, and the plan can survive the disaster)

Testing

Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.

There will be issues, like a system that wasn’t included in the written plan, or a technical issue that you didn’t know existed, to something a simple as unknown system passwords or missing software installation keys. But that is what a test is all about. You have to test tot find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when you are confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.

Business Continuity Planning with Cloud Services

Business Continuity Planning is the process of creating systems of prevention and recovery to deal with potential threats to a company. Doing that has historically meant building a solution specific to your company, at possibly a great expense. With recent improvements to cloud services, it may be possible to reduce the cost of the required services with the proper planning.

With statistics from FEMA telling us that 40 percent of all small businesses never reopen after a disaster, you need to ask yourself how you can help protect the business in the event of a natural or ma-made disaster. It is also estimated the 3 out of 4 small to mid-sized businesses don’t have a written disaster recovery plan, and most never purchase disaster insurance.

Most small businesses don’t have insurance to protect them financially. They don’t even have a written plan on how to deal step-by-step with disasters, even without a step to “call the insurance agent”. Those businesses probably don’t think they can afford to insure for something that will probably never happen, or even write a plan for what to do if it does happen.

I have written on this subject before, but you need to have a written plan. Sit down with your team and discuss the various scenarios that could happen (fire, flood, earthquake, hurricane, tornado, tsunami, landslide, industrial accident, power failure, terrorist attack, etc.), write down how you would deal with each issue, and how it could impact the company. The idea is to understand the strengths and weaknesses of your current infrastructure and begin to plan and train for the day something horrible happens. If you have written plan that helps your company get through a disaster with minimal impact, you will continue to have a job. If you don’t have a written plan, your company will probably fail to recover from the impact even a minor disaster could have on the company revenue stream.

Disaster Example

Scenario

There is an electrical fire in your data center at 3 am. The automated suppression system turns off power and handles the flames while the fire department is dispatched to investigate. The local fire chief identifies an electrical issue by 9 am and orders the facility closed for 3 days starting right now. Everything must be powered off for 3 days starting right now. You have just lost your data center for at least 72 hours. What do you do now?

No Plan

What would happen if you didn’t have a plan? By the time you got the entire team together to formulate a plan it would have been 2-4 hours into a network outage. If your business relies on the employees and customers to get to those servers to generate revenue, they couldn’t have done anything until you determine a solution and get the pieces into place. How long would it take to rebuild the network infrastructure required to support your business? Redirecting internet traffic, configuring network security, helping employees and customers get connected, etc. could literally take days.

Disaster Plan

You had talked about this possibility several months ago and you rented space in a failover facility several miles away. The data has been replicating to secondary servers for many months, and you have even tested the failover process a few times. When the team is notified of the fire, the team automatically failed the entire server system to the failover site by 3:30 am. The outage was measures in minutes, and the employees and customers did’t even notice the outage.

Do you see what impact a plan has on such a small disaster. The cost of this solution could be a much a $5000 a month, but the savings during a disaster could be measured in millions.

Cloud Services

The value of cloud services to provide cheaper and faster solutions to disaster recovery and business continuity is something you need to investigate. If you could push your mission critical applications into a cloud solution, like Amazon or Microsoft hosting, you might continue your business even if your datacenter was destroyed in a major fire.

What do you do to determine if you can solve your issue with cloud services?

  1. Identify mission-critical applications and data by performing a risk assessment. A risk assessment will tell you what types of incidents (natural or man-made) are most dangerous to your environment. You should also perform an asset inventory, making sure you understand what your company owns and what it is all worth. This simple analysis will allow the business to calculate the potential impact of most likely threats and prioritize their response accordingly.
  2. Determine when operations should resume by measuring success against established Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO). This can be analyzed through stakeholder interviews and frequent testing to understand what is possible while you verify the plan provides flexibility in the disaster recovery solution the business selects. You don’t want to build a solution that allows for complete failover in 4 hours if the business stakeholders measures failure in minutes.
  3. Identify a backup worksite in the event the business becomes unsafe. Just like our example above, a simple disaster could cause major issues. I know of one company that arranged to meet at the local library for the three days it took to finalize a lease on temporary office space. You need to have some type of plan and be able to lead the other employees to keep them safe and productive during an unscheduled office closure.
  4. Design and publish a written business continuity plan (BCP) or disaster recovery plan (DRP), making sure it is accessible from anywhere by everyone. It is great having a document on the network, but in a true disaster will the people who need to see it even be capable of opening the file?
  5. Make sure the employees who need to know about the plan and are familiar with it, and know who to talk to if they have questions or concerns. When disaster strikes a businesses with a written plan and clear communication will quickly cut through the chaos and get the business back on track.
  6. Businesses should review the plan quarterly, and put the plan to an actual test at least once per calendar year. With the maturity of cloud solutions, you also have to take into account migrating and living with a more diverse private/public cloud portfolio. Ensure you have a go-to plan for the future to expand into new things like high availability, and archiving within clouds.
  7. Regularly review existing systems and services to verify they are protected by the plan. Consider changes to systems that are good targets for moving them from local servers to a cloud-based systems.
  8. Discuss any proposed system to make sure cloud services are properly considered. While not all systems or services are ideal candidates for cloud-based solutions, many are perfect for moving into the cloud.

Cloud-based services won’t prevent a disaster, and they aren’t even 100% free from outages themselves. What they can do is provide inexpensive failover services that are too expensive or too complicated for a small business to attempt on their own.

Are you ready for a SAN failure?

Disaster Recovery

Have you planned for a failure of your SAN today or next week? Maybe you can’t prevent a SAN failure, but you might be able to prevent a failure from ruining your week. You can never fully prevent a hardware failure 100% of the time, even in a SAN, even when your hardware vendor says it can never happen. So if you accept it can happen, that it might happen soon, then you can make plans to minimize the impact.

You SAN actually has a lot of complicated components and plenty of moving parts that can fail, just like any technology.  Just like pilots practice system failure in a simulator so that they can react properly under the real stress of a major system failure during a real flight, you want your IT team to practice their response to a major system failure under simulated situations and not during a real failure. You should document how you should react to various scenarios, practice using that written documentation to verify it is complete and accurate for each documented scenario, and keep the documentation on a regular basis as technology and business practices change.

Consolidate and Save

You have been told by vendors for years, and have finally started listening to those vendors, by consolidating your server environment to save money on hardware, licensing, cooling, power, etc. by using virtual servers and a SAN for storage. Now you have put all your eggs (very expensive eggs) into one basket. Do you treat that basket with the same level of importance as you should, knowing that all your data, all your emails, all your applications, everything important to your business is running on this small set of complicated components?

Complacency is the killer of a great incident recovery plan.

Complacency is a serious problem in IT environments across the globe. While no one is likely to die if a disaster strikes your server room, when disaster strikes people expect smart people to have a smart solution to the problem. When you are in the planning meetings, are you the one arguing that you can buy enough redundant parts and pieces to eliminate failure, or are you the one that accepts you can’t buy your way out of a potential system failure and insist on planning for an inevitable system failure?

But We Never Had A Failure Before

It can be difficult being the person that fights for the proper written plan for scenarios that everyone is telling you can’t possibly happen. It will be hard getting the team to test a plan when they insist those components can’t possibly fail, that one broken part can’t cause the system to stop responding, that the vendor won’t allow the system to to freeze during a system update, etc.

When was the last time you had a system failure that caused a major disaster? Most haven’t had to deal with a true disaster, and will often ask why we spend so much time and money planning for something that has never happened and probably will never happen. I’ve had people ask me why do I even worry about some of the scenarios I propose during my technical team meetings if the chances of those disasters are so remote and the cost of planning for those scenarios is so expensive. I usually answer that understanding what could happen is the first step toward understanding how much it could cost to prevent the issue. If you know that the chance of a hard drive failure is remote, you might not plan for the failure and so you save the money by not getting the level of support required for fast drive replacement, mounting hot spares, keeping cold spares on hand, etc. If you understand the chance of a hard drive failure is actually quite high, you might make different decisions. I’ll do what research I can to guide my team to understanding the different risk scenarios and allow the team to judge each risk and spend our money on the items of greatest risk.

This is actually incredibly easy. Some might say this is incredibly easy. All you have to do is get cynical, look for faults, and imagine what could possibly go wrong. You can either spend money preventing a disaster you might never actually see, or you might save a few dollars today and wish you had spent a lot more money in hindsight. Hindsight is cheap because anyone can look back and tell you what you could and should have done differently to better prepare for the disaster… after the disaster has already happened.

Plan for the very worst, and hope for the very best.

Redundancy Is Your Friend

If your environment diagrams look like a whole bunch of production systems or databases sitting on top of a carefully sculpted set of expensive single points of failure – you might be doing something incredibly wrong. Putting everything on an expensive SAN won’t mean you have a system that can’t fail.

So here is where you and your entire IT organization need to come together and plan the best shape of your environment – and you will end up  taking known risks and making decisions based on risk and reward – but you will feel a lot better having that conversation and making informed decisions. You may not get to the level of fault tolerance that you want or is ideal for your paranoia, but at least you gave the risks and rewards and helped define the cost and benefit analysis up front. The point is, if you are the database administrator and you see a lot of room for failures you need to call those out and give suggested solutions, understand the business needs, discuss the pros and cons, and help architect the right solution for you and your workload.

The point is you need to understand the risk, configure your environment to minimize risk, and destroy all single points of failure.

A Realistic Approach To High Availability and Disaster Recovery

When designing a new system, or even reviewing an existing system, we are asked about expected uptime. Business users often have unrealistic expectations that drive our responses, one of which is around expected uptime and acceptable downtime. While it is acceptable to target 99.999% uptime (five nines),  you have a responsibility to target the uptime your business requires by helping determine what your business really needs.

Here are some guidelines on setting realistic goals for your High Availability (HA) and Disaster Recovery (DR) solutions:

  1. Business Impact – Begin your analysis  with a business impact analysis. A business impact analysis will answer the questions around what the actual impact a failed process will have on your business. The business stakeholders will be the people who can properly answers those questions, because they will have more information on the direct impact to sales or services an outage will cause. Using facts is the best way to document expected cost of outages, and that will drive your response to prevent those outages. If it is inconvenient if a system is down, it should be a lower priority (and therefore less money spent to keep the system online) than a system that will kill the business if it fails. The higher revenue generating items should naturally move to the top of the priority list.
  2. Define Uptime – After identifying the different business processes and systems that your company operations depend on every day, define what specific uptime requirements exist for those priority systems. This is driven by operating hours (if the business systems only used during normal local business hours, 24×7 in North America, or 24×7 internationally will drive different expectations) and revenue. It is unrealistic to expect that a system will be available 24x7x365 without any allowance for maintenance. If that is a requirement you need to design a fully functional failover system, which will increase cost and complexity.  Once your stakeholders see that cost they may scale their requirements back to a more realistic (and less expensive) 20x7x365, or some other less expensive and less complex solution. The key is to include the stakeholders in the entire decision, not just asking them what they want. Offer solutions and the associated cost models and allow them to decide what they can afford. The difference between what they asked for and what they are willing to finance is usually measured in stacks of cash.
  3. Service Level Agreements – You now understand the needs of the business and how to design a system to meet those needs, but can you meet those needs. You should document the service level agreement (SLA) that outlines the expectations around measuring uptime, failover procedures, system recovery windows, and corporate disaster recovery requirements. The definition of service level agreements should include what’s required during normal business hours and outside of business hours. How are you going to measure your success in meeting the business requirements without a disaster? The stakeholders have asked for specific system uptimes, but have you met the expectation 3 or even 6 months into the system usage? Have you tested the system failover or disaster recovery process to verify the system you designed, implemented and paid for is actually configured to do the things you told the stakeholders it would do for them?
  4. Disaster Recovery – You should perform a disaster recovery test at least once per calendar year, or more frequently if it is requested by the business system stakeholders. That should be pre-defined in the SLA, and evidence provided to prove you meet the expectations outlined in the SLA around downtime, failover expectations, recovery windows, etc.

You must effectively communicate with business stakeholders to determine your specific business requirements, without using default (and often unrealistic) business continuity requirements. This should allow you to provide the uptime and failover services as requested by the stakeholders while you to provide cost effective solutions that are measureable, robust, and repeatable.

Incident Recovery Planning

In your business, you are probably the only one tasked with understanding what types of disasters can strike your business and the task of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.

About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.

History

Modern technology incident recovery planning was created in the mid 1970s because organizations started to build and use computer systems. In those days the systems were large mainframe and they were fairly easy to document and to replicate for testing. By late 1978, Sun Information System (later renamed to Sungard Availability Systems) would be created in Philadelphia as the first commercial hot site vendor in the US.

While the market for companies that help businesses with disaster recovery planning grew through the 1980s, the growth of the internet caused many more companies to look at a robust solution to disaster planning. With the recent growth to cloud computing, it doesn’t matter as much where systems are located. It only matters that the systems are secure, stable, and reliable.

Understand The Risk

Do you understand the risks to your business? Have you looked at the systems you business uses and depends on each day and thought about what would happen if those systems were unavailable? Have you thought about the common risks for your area (tornadoes, earth quakes, hurricanes, blizzard, floods, wild fires, volcanic eruption, etc.) and considered how you would deal with these issues?

Disaster-Map

Maybe there are risks unique to your location, like frequent power outages, danger of break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters.

Written Plan

You need to think about each of the risks scenarios, and write down your plan for how you and your team would address those scenarios to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, but there may be periodic changes as system and risks change.

  1. List of Employees (what they do, when they do it, why the do it, etc.)
  2. Inventory Systems (office equipment, servers, laptops, etc.)
  3. Office Space Requirements (can everything be done remotely, or will the users need office space to access restored systems)
  4. Insurance and Budget Concerns (who will provide money during an actual recovery)
  5. Share The Plan (make sure you aren’t the only one with a copy of the plan, and the plan can survive the incident)

This written plan is a “living document”, it will change as often as your business changes. The idea is to keep the business running even if everything stops working. You have to look at everything important to your company, and determine how you would keep it working if there were a catastrophic failure of one or more systems that are important to your company. You don’t want to write this plan by yourself, as everyone in the business has a stake it keeping the business operational.

What would you do if your data center was struck with a tornado, hurricane, or earth quake? Would those systems be protected from damage? What if there was a major failure of your systems, the power infrastructure, telecommunications network, etc. Do you have adequate data and system backups? How long would those systems be down before you could purchase new hardware, configure the new hardware for your network, restore your data from backups, test the system integration, and implement those new systems?

You have to think of the ranges of disasters, from a single piece of hardware failure to a massive failure because of flooding or other major natural disasters. What will your response be to a data breach? Do you have any contracts or agreements that will allow you to borrow or rent any required hardware or software that will get you through the first 30 days of a disaster? Do you expect to download your backups or installation media from the internet? What if there isn’t any internet access, your backup site is down, or the access is too slow to make it useful.

Begin small by making a plan that addresses the most likely disasters. Then work your way up from there, adding new scenarios as you uncover new possible issues or the scope of your environment changes.

Testing

Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.

There will be issues, like a system that wasn’t included in the written plan, or a technical issue that you didn’t know existed. It could be something a simple as unknown system passwords or missing software installation keys. But that is what a test is all about. You have to test to find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when you are confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.

If the most likely disaster in your environment is hardware failure, then that should definitely be something you evaluate and test at least once per year. Call your vendors and ask them to verify your service level agreement (SLA) to make sure your expectations match their support agreements. You should also disperse your hardware spare pool to a second location.

If you are at great risk of tornado or hurricane, then you have to analyze how well you have protected your environment from the negative impacts of severe weather. Look at the backup power supply fuel sources and verify the methods of dealing with raising flood waters.

You should be testing those backup systems, verifying your backup tapes, testing your ability to replace a physical server or network switch, and reviewing the plan so that you know the process is adequately documented.

Disaster Recovery Planning

In your business, you are probably the only one tasked with understanding what types of disasters can strike your business and the task of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.

About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.

Understand The Risk

Do you even understand the risks to your business? Have you looked at the systems you business uses and depends on each day and though about what would happen if they systems were unavailable? Have you though about the common risks for the area, including tornadoes, earth quakes, hurricanes, floods, etc.?

Disaster-Map

Maybe there are risks unique to your location, like frequent power outages, danger of break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters. You don’t want someone stealing your servers or hard drives in the middle of the night, or cracks in the walls leading to mice chewing through your network or power cables.

Written Plan

You need to think about each of the risks scenarios, an write down you plan for how you and your team would address those scenarios to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, but there may be periodic changes as system and risks change.

  1. List of Employees (what they do, when they do it, why the do it, etc.)
  2. Inventory Systems (office equipment, servers, laptops, etc.)
  3. Office Space Requirements (can everything be done remotely, or will the users need office space to access restored systems)
  4. Insurance and Budget Concerns (who will provide money during an actual recovery)
  5. Share The Plan (make sure you aren’t the only one with a copy of the plan, and the plan can survive the disaster)

Testing

Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.

There will be issues, like a system that wasn’t included in the written plan, or a technical issue that you didn’t know existed, to something a simple as unknown system passwords or missing software installation keys. But that is what a test is all about. You have to test tot find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when you are confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.