Effective Disaster Recovery Planning

Server Stack - @SeniorDBA

In your business, you might be the only one tasked with understanding what types of disasters can strike your business and assigned the responsibility of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.

About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.

Understand The Risk

Do you fully understand the risks to your business? Have you looked at the systems your business uses and depends on each day and thought about what would happen if those systems were unavailable? Have you thought about the common risks for the area? These risks could include tornadoes, earth quakes, hurricanes, floods, etc.

Disaster Map - @SeniorDBA

Maybe there are man-made risks unique to your location, like frequent power outages, dangerous break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters. You don’t want someone stealing your servers or hard drives in the middle of the night, or cracks in the walls leading to mice chewing through your network or power cables.

Written Plan

You need to think about each of the risks scenarios, and write down your plan for how you and your team would address each scenario to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, and there may be periodic changes as systems and risks change.

  1. List of Employees (what they do, when they do it, why they do it, etc.)
  2. Inventory Systems (office equipment, servers, laptops, etc.)
  3. Office Space Requirements (you will need space to restore your systems, but can everything be done remotely, or will the users need office space to access restored systems)
  4. Insurance and Budget Concerns (who will provide money during an actual recovery)
  5. Share The Plan (make sure you aren’t the only one with a copy of the plan, and make sure the plan can survive the disaster)

Testing

Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.

There will be issues, like a system that wasn’t included in the written plan or a technical issue that you didn’t know existed. An issue could be something a simple as unknown system passwords or a missing software installation key. But that is what a test is all about. You have to test to find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when your team is confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.

Team Meeting - @SeniorDBA

Don’t allow your business to fail because of an interruption you could have resolved with the proper planning and some simple testing.

Physical Security and SQL Server

Make sure your SQL Server database host is located in a locked room with controlled access, your host server uses redundant power, and the data center has fire protection systems unique to a computer environment. Don’t just assume that this is the case at your organization. Verify these requirements are in place yourself or confirm it in writing with your infrastructure administrator responsible for your database server.

Also take a look at the hardware strategy involving the database server. Security is about protecting your systems and databases from hackers, but also about the availability of the database. You should verify the system is using RAID or some other disk-mirroring solution. Also investigate the disaster recovery plan and determine what would happen if your database server crashed. What is the plan for getting your system recovered and available to the users? If you aren’t involved in the planning and documentation of the Incident Recovery Plan, talk to your supervisor and get involved today.

Effective Disaster Recovery Planning

In your business, you are probably the only one tasked with understanding what types of disasters can strike your business and the task of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.

About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.

Understand The Risk

Do you even understand the risks to your business? Have you looked at the systems you business uses and depends on each day and though about what would happen if they systems were unavailable? Have you though about the common risks for the area, including tornadoes, earth quakes, hurricanes, floods, etc.?

Disaster-Map

Maybe there are risks unique to your location, like frequent power outages, danger of break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters. You don’t want someone stealing your servers or hard drives in the middle of the night, or cracks in the walls leading to mice chewing through your network or power cables.

Written Plan

You need to think about each of the risks scenarios, an write down you plan for how you and your team would address those scenarios to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, but there may be periodic changes as system and risks change.

  1. List of Employees (what they do, when they do it, why the do it, etc.)
  2. Inventory Systems (office equipment, servers, laptops, etc.)
  3. Office Space Requirements (can everything be done remotely, or will the users need office space to access restored systems)
  4. Insurance and Budget Concerns (who will provide money during an actual recovery)
  5. Share The Plan (make sure you aren’t the only one with a copy of the plan, and the plan can survive the disaster)

Testing

Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.

There will be issues, like a system that wasn’t included in the written plan, or a technical issue that you didn’t know existed, to something a simple as unknown system passwords or missing software installation keys. But that is what a test is all about. You have to test tot find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when you are confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.

One line of “Bad Code” to Destroy Entire Company

troubleshoot-SQL

If you have been in the technology field for more than a few weeks, you know that people make mistakes. Someone with administrator-level access can make big mistakes. One wrong line of code can either make a program behave in strange and unexpected ways, but that program might also do some serious damage if the user account running the program has elevated privileges.

There is a story that details how a system administrator incorrectly ran a line of code to delete all the files on his server. While the story is sad (and kind of funny at the same time) it can also be a lesson for the IT crowd reading about this now. This person did some basic things that led to this mistake being so very serious:

  • Making untested changes on production servers
  • Storing backups on production servers
  • Not having a written Incident Recovery or Business Continuity plan

This individual has now probably destroyed his business with a simple mistake. He might be able to retain is business partners and rebuild his business from scratch, but most of these businesses will probably not trust him enough to continue to pay him for professional services.

As you reflect on this person’s pain, remember these important lessons:

  • Create a written Incident Recovery Plan
  • Test your Incident Recovery Plan once per calendar year
  • Create frequent backups and store them off-site, separate from any production servers
  • Always use a test environment to create and test changes
  • Only run tested and verified code on production servers
  • Always plan for the worst and hope for the best

Business Continuity Planning with Cloud Services

Business Continuity Planning is the process of creating systems of prevention and recovery to deal with potential threats to a company. Doing that has historically meant building a solution specific to your company, at possibly a great expense. With recent improvements to cloud services, it may be possible to reduce the cost of the required services with the proper planning.

With statistics from FEMA telling us that 40 percent of all small businesses never reopen after a disaster, you need to ask yourself how you can help protect the business in the event of a natural or ma-made disaster. It is also estimated the 3 out of 4 small to mid-sized businesses don’t have a written disaster recovery plan, and most never purchase disaster insurance.

Most small businesses don’t have insurance to protect them financially. They don’t even have a written plan on how to deal step-by-step with disasters, even without a step to “call the insurance agent”. Those businesses probably don’t think they can afford to insure for something that will probably never happen, or even write a plan for what to do if it does happen.

I have written on this subject before, but you need to have a written plan. Sit down with your team and discuss the various scenarios that could happen (fire, flood, earthquake, hurricane, tornado, tsunami, landslide, industrial accident, power failure, terrorist attack, etc.), write down how you would deal with each issue, and how it could impact the company. The idea is to understand the strengths and weaknesses of your current infrastructure and begin to plan and train for the day something horrible happens. If you have written plan that helps your company get through a disaster with minimal impact, you will continue to have a job. If you don’t have a written plan, your company will probably fail to recover from the impact even a minor disaster could have on the company revenue stream.

Disaster Example

Scenario

There is an electrical fire in your data center at 3 am. The automated suppression system turns off power and handles the flames while the fire department is dispatched to investigate. The local fire chief identifies an electrical issue by 9 am and orders the facility closed for 3 days starting right now. Everything must be powered off for 3 days starting right now. You have just lost your data center for at least 72 hours. What do you do now?

No Plan

What would happen if you didn’t have a plan? By the time you got the entire team together to formulate a plan it would have been 2-4 hours into a network outage. If your business relies on the employees and customers to get to those servers to generate revenue, they couldn’t have done anything until you determine a solution and get the pieces into place. How long would it take to rebuild the network infrastructure required to support your business? Redirecting internet traffic, configuring network security, helping employees and customers get connected, etc. could literally take days.

Disaster Plan

You had talked about this possibility several months ago and you rented space in a failover facility several miles away. The data has been replicating to secondary servers for many months, and you have even tested the failover process a few times. When the team is notified of the fire, the team automatically failed the entire server system to the failover site by 3:30 am. The outage was measures in minutes, and the employees and customers did’t even notice the outage.

Do you see what impact a plan has on such a small disaster. The cost of this solution could be a much a $5000 a month, but the savings during a disaster could be measured in millions.

Cloud Services

The value of cloud services to provide cheaper and faster solutions to disaster recovery and business continuity is something you need to investigate. If you could push your mission critical applications into a cloud solution, like Amazon or Microsoft hosting, you might continue your business even if your datacenter was destroyed in a major fire.

What do you do to determine if you can solve your issue with cloud services?

  1. Identify mission-critical applications and data by performing a risk assessment. A risk assessment will tell you what types of incidents (natural or man-made) are most dangerous to your environment. You should also perform an asset inventory, making sure you understand what your company owns and what it is all worth. This simple analysis will allow the business to calculate the potential impact of most likely threats and prioritize their response accordingly.
  2. Determine when operations should resume by measuring success against established Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO). This can be analyzed through stakeholder interviews and frequent testing to understand what is possible while you verify the plan provides flexibility in the disaster recovery solution the business selects. You don’t want to build a solution that allows for complete failover in 4 hours if the business stakeholders measures failure in minutes.
  3. Identify a backup worksite in the event the business becomes unsafe. Just like our example above, a simple disaster could cause major issues. I know of one company that arranged to meet at the local library for the three days it took to finalize a lease on temporary office space. You need to have some type of plan and be able to lead the other employees to keep them safe and productive during an unscheduled office closure.
  4. Design and publish a written business continuity plan (BCP) or disaster recovery plan (DRP), making sure it is accessible from anywhere by everyone. It is great having a document on the network, but in a true disaster will the people who need to see it even be capable of opening the file?
  5. Make sure the employees who need to know about the plan and are familiar with it, and know who to talk to if they have questions or concerns. When disaster strikes a businesses with a written plan and clear communication will quickly cut through the chaos and get the business back on track.
  6. Businesses should review the plan quarterly, and put the plan to an actual test at least once per calendar year. With the maturity of cloud solutions, you also have to take into account migrating and living with a more diverse private/public cloud portfolio. Ensure you have a go-to plan for the future to expand into new things like high availability, and archiving within clouds.
  7. Regularly review existing systems and services to verify they are protected by the plan. Consider changes to systems that are good targets for moving them from local servers to a cloud-based systems.
  8. Discuss any proposed system to make sure cloud services are properly considered. While not all systems or services are ideal candidates for cloud-based solutions, many are perfect for moving into the cloud.

Cloud-based services won’t prevent a disaster, and they aren’t even 100% free from outages themselves. What they can do is provide inexpensive failover services that are too expensive or too complicated for a small business to attempt on their own.

Incident Recovery Planning

In your business, you are probably the only one tasked with understanding what types of disasters can strike your business and the task of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.

About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.

History

Modern technology incident recovery planning was created in the mid 1970s because organizations started to build and use computer systems. In those days the systems were large mainframe and they were fairly easy to document and to replicate for testing. By late 1978, Sun Information System (later renamed to Sungard Availability Systems) would be created in Philadelphia as the first commercial hot site vendor in the US.

While the market for companies that help businesses with disaster recovery planning grew through the 1980s, the growth of the internet caused many more companies to look at a robust solution to disaster planning. With the recent growth to cloud computing, it doesn’t matter as much where systems are located. It only matters that the systems are secure, stable, and reliable.

Understand The Risk

Do you understand the risks to your business? Have you looked at the systems you business uses and depends on each day and thought about what would happen if those systems were unavailable? Have you thought about the common risks for your area (tornadoes, earth quakes, hurricanes, blizzard, floods, wild fires, volcanic eruption, etc.) and considered how you would deal with these issues?

Disaster-Map

Maybe there are risks unique to your location, like frequent power outages, danger of break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters.

Written Plan

You need to think about each of the risks scenarios, and write down your plan for how you and your team would address those scenarios to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, but there may be periodic changes as system and risks change.

  1. List of Employees (what they do, when they do it, why the do it, etc.)
  2. Inventory Systems (office equipment, servers, laptops, etc.)
  3. Office Space Requirements (can everything be done remotely, or will the users need office space to access restored systems)
  4. Insurance and Budget Concerns (who will provide money during an actual recovery)
  5. Share The Plan (make sure you aren’t the only one with a copy of the plan, and the plan can survive the incident)

This written plan is a “living document”, it will change as often as your business changes. The idea is to keep the business running even if everything stops working. You have to look at everything important to your company, and determine how you would keep it working if there were a catastrophic failure of one or more systems that are important to your company. You don’t want to write this plan by yourself, as everyone in the business has a stake it keeping the business operational.

What would you do if your data center was struck with a tornado, hurricane, or earth quake? Would those systems be protected from damage? What if there was a major failure of your systems, the power infrastructure, telecommunications network, etc. Do you have adequate data and system backups? How long would those systems be down before you could purchase new hardware, configure the new hardware for your network, restore your data from backups, test the system integration, and implement those new systems?

You have to think of the ranges of disasters, from a single piece of hardware failure to a massive failure because of flooding or other major natural disasters. What will your response be to a data breach? Do you have any contracts or agreements that will allow you to borrow or rent any required hardware or software that will get you through the first 30 days of a disaster? Do you expect to download your backups or installation media from the internet? What if there isn’t any internet access, your backup site is down, or the access is too slow to make it useful.

Begin small by making a plan that addresses the most likely disasters. Then work your way up from there, adding new scenarios as you uncover new possible issues or the scope of your environment changes.

Testing

Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.

There will be issues, like a system that wasn’t included in the written plan, or a technical issue that you didn’t know existed. It could be something a simple as unknown system passwords or missing software installation keys. But that is what a test is all about. You have to test to find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when you are confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.

If the most likely disaster in your environment is hardware failure, then that should definitely be something you evaluate and test at least once per year. Call your vendors and ask them to verify your service level agreement (SLA) to make sure your expectations match their support agreements. You should also disperse your hardware spare pool to a second location.

If you are at great risk of tornado or hurricane, then you have to analyze how well you have protected your environment from the negative impacts of severe weather. Look at the backup power supply fuel sources and verify the methods of dealing with raising flood waters.

You should be testing those backup systems, verifying your backup tapes, testing your ability to replace a physical server or network switch, and reviewing the plan so that you know the process is adequately documented.

Disaster Recovery Planning

In your business, you are probably the only one tasked with understanding what types of disasters can strike your business and the task of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.

About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.

Understand The Risk

Do you even understand the risks to your business? Have you looked at the systems you business uses and depends on each day and though about what would happen if they systems were unavailable? Have you though about the common risks for the area, including tornadoes, earth quakes, hurricanes, floods, etc.?

Disaster-Map

Maybe there are risks unique to your location, like frequent power outages, danger of break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters. You don’t want someone stealing your servers or hard drives in the middle of the night, or cracks in the walls leading to mice chewing through your network or power cables.

Written Plan

You need to think about each of the risks scenarios, an write down you plan for how you and your team would address those scenarios to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, but there may be periodic changes as system and risks change.

  1. List of Employees (what they do, when they do it, why the do it, etc.)
  2. Inventory Systems (office equipment, servers, laptops, etc.)
  3. Office Space Requirements (can everything be done remotely, or will the users need office space to access restored systems)
  4. Insurance and Budget Concerns (who will provide money during an actual recovery)
  5. Share The Plan (make sure you aren’t the only one with a copy of the plan, and the plan can survive the disaster)

Testing

Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.

There will be issues, like a system that wasn’t included in the written plan, or a technical issue that you didn’t know existed, to something a simple as unknown system passwords or missing software installation keys. But that is what a test is all about. You have to test tot find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when you are confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.