When designing a new system, or even reviewing an existing system, we are asked about expected uptime. Business users often have unrealistic expectations that drive our responses, one of which is around expected uptime and acceptable downtime. While it is acceptable to target 99.999% uptime (five nines), you have a responsibility to target the uptime your business requires by helping determine what your business really needs.
Here are some guidelines on setting realistic goals for your High Availability (HA) and Disaster Recovery (DR) solutions:
- Business Impact – Begin your analysis with a business impact analysis. A business impact analysis will answer the questions around what the actual impact a failed process will have on your business. The business stakeholders will be the people who can properly answers those questions, because they will have more information on the direct impact to sales or services an outage will cause. Using facts is the best way to document expected cost of outages, and that will drive your response to prevent those outages. If it is inconvenient if a system is down, it should be a lower priority (and therefore less money spent to keep the system online) than a system that will kill the business if it fails. The higher revenue generating items should naturally move to the top of the priority list.
- Define Uptime – After identifying the different business processes and systems that your company operations depend on every day, define what specific uptime requirements exist for those priority systems. This is driven by operating hours (if the business systems only used during normal local business hours, 24×7 in North America, or 24×7 internationally will drive different expectations) and revenue. It is unrealistic to expect that a system will be available 24x7x365 without any allowance for maintenance. If that is a requirement you need to design a fully functional failover system, which will increase cost and complexity. Once your stakeholders see that cost they may scale their requirements back to a more realistic (and less expensive) 20x7x365, or some other less expensive and less complex solution. The key is to include the stakeholders in the entire decision, not just asking them what they want. Offer solutions and the associated cost models and allow them to decide what they can afford. The difference between what they asked for and what they are willing to finance is usually measured in stacks of cash.
- Service Level Agreements – You now understand the needs of the business and how to design a system to meet those needs, but can you meet those needs. You should document the service level agreement (SLA) that outlines the expectations around measuring uptime, failover procedures, system recovery windows, and corporate disaster recovery requirements. The definition of service level agreements should include what’s required during normal business hours and outside of business hours. How are you going to measure your success in meeting the business requirements without a disaster? The stakeholders have asked for specific system uptimes, but have you met the expectation 3 or even 6 months into the system usage? Have you tested the system failover or disaster recovery process to verify the system you designed, implemented and paid for is actually configured to do the things you told the stakeholders it would do for them?
- Disaster Recovery – You should perform a disaster recovery test at least once per calendar year, or more frequently if it is requested by the business system stakeholders. That should be pre-defined in the SLA, and evidence provided to prove you meet the expectations outlined in the SLA around downtime, failover expectations, recovery windows, etc.
You must effectively communicate with business stakeholders to determine your specific business requirements, without using default (and often unrealistic) business continuity requirements. This should allow you to provide the uptime and failover services as requested by the stakeholders while you to provide cost effective solutions that are measureable, robust, and repeatable.