In your business, you are probably the only one tasked with understanding what types of disasters can strike your business and the task of planning to prevent those disasters from bringing down the business. As Alan Lakein said many years ago, “Failure to plan is planning to fail”. As an information technology professional, one of your many tasks is to understand the risks to your business systems and plan to prevent or overcome those risks from impacting your business.
About 40% of businesses do not re-open after a disaster and another 25% fail within one year according to the Federal Emergency Management Agency (FEMA). Similar statistics from the United States Small Business Administration indicate that over 90% of businesses fail within two years after a disaster.
Modern technology incident recovery planning was created in the mid 1970s because organizations started to build and use computer systems. In those days the systems were large mainframe and they were fairly easy to document and to replicate for testing. By late 1978, Sun Information System (later renamed to Sungard Availability Systems) would be created in Philadelphia as the first commercial hot site vendor in the US.
While the market for companies that help businesses with disaster recovery planning grew through the 1980s, the growth of the internet caused many more companies to look at a robust solution to disaster planning. With the recent growth to cloud computing, it doesn’t matter as much where systems are located. It only matters that the systems are secure, stable, and reliable.
Understand The Risk
Do you understand the risks to your business? Have you looked at the systems you business uses and depends on each day and thought about what would happen if those systems were unavailable? Have you thought about the common risks for your area (tornadoes, earth quakes, hurricanes, blizzard, floods, wild fires, volcanic eruption, etc.) and considered how you would deal with these issues?
Maybe there are risks unique to your location, like frequent power outages, danger of break-ins, poor building construction, etc. Each of these unique threats can be just a dangerous as natural disasters.
You need to think about each of the risks scenarios, and write down your plan for how you and your team would address those scenarios to keep the business up and running with minimal down time. You may have to adjust the plan to address concerns about cost and time, but there may be periodic changes as system and risks change.
- List of Employees (what they do, when they do it, why the do it, etc.)
- Inventory Systems (office equipment, servers, laptops, etc.)
- Office Space Requirements (can everything be done remotely, or will the users need office space to access restored systems)
- Insurance and Budget Concerns (who will provide money during an actual recovery)
- Share The Plan (make sure you aren’t the only one with a copy of the plan, and the plan can survive the incident)
This written plan is a “living document”, it will change as often as your business changes. The idea is to keep the business running even if everything stops working. You have to look at everything important to your company, and determine how you would keep it working if there were a catastrophic failure of one or more systems that are important to your company. You don’t want to write this plan by yourself, as everyone in the business has a stake it keeping the business operational.
What would you do if your data center was struck with a tornado, hurricane, or earth quake? Would those systems be protected from damage? What if there was a major failure of your systems, the power infrastructure, telecommunications network, etc. Do you have adequate data and system backups? How long would those systems be down before you could purchase new hardware, configure the new hardware for your network, restore your data from backups, test the system integration, and implement those new systems?
You have to think of the ranges of disasters, from a single piece of hardware failure to a massive failure because of flooding or other major natural disasters. What will your response be to a data breach? Do you have any contracts or agreements that will allow you to borrow or rent any required hardware or software that will get you through the first 30 days of a disaster? Do you expect to download your backups or installation media from the internet? What if there isn’t any internet access, your backup site is down, or the access is too slow to make it useful.
Begin small by making a plan that addresses the most likely disasters. Then work your way up from there, adding new scenarios as you uncover new possible issues or the scope of your environment changes.
Just like database backups aren’t useful if you can’t restore them, a Disaster Recovery Plan is worthless if you can’t implement the plan. You should conduct a formal test at least once each calendar year, testing if the plan will work for one or more of the scenarios you are planning against. The test should be a realistic as possible, and make sure you have a method of measuring the level of success.
There will be issues, like a system that wasn’t included in the written plan, or a technical issue that you didn’t know existed. It could be something a simple as unknown system passwords or missing software installation keys. But that is what a test is all about. You have to test to find those little things that were forgotten or unknown, and then update the written plan to make sure it isn’t an issue during the next test. Eventually you will have everything you need addressed in the plan, and the next test will go smoothly. That means in the event of a actual disaster, when you are confused and under an elevated level of stress, you are more likely to get these core production systems up and running quickly.
If the most likely disaster in your environment is hardware failure, then that should definitely be something you evaluate and test at least once per year. Call your vendors and ask them to verify your service level agreement (SLA) to make sure your expectations match their support agreements. You should also disperse your hardware spare pool to a second location.
If you are at great risk of tornado or hurricane, then you have to analyze how well you have protected your environment from the negative impacts of severe weather. Look at the backup power supply fuel sources and verify the methods of dealing with raising flood waters.
You should be testing those backup systems, verifying your backup tapes, testing your ability to replace a physical server or network switch, and reviewing the plan so that you know the process is adequately documented.