Have you planned for a failure of your SAN today or next week? Maybe you can’t prevent a SAN failure, but you might be able to prevent a failure from ruining your week. You can never fully prevent a hardware failure 100% of the time, even in a SAN, even when your hardware vendor says it can never happen. So if you accept it can happen, that it might happen soon, then you can make plans to minimize the impact.
You SAN actually has a lot of complicated components and plenty of moving parts that can fail, just like any technology. Just like pilots practice system failure in a simulator so that they can react properly under the real stress of a major system failure during a real flight, you want your IT team to practice their response to a major system failure under simulated situations and not during a real failure. You should document how you should react to various scenarios, practice using that written documentation to verify it is complete and accurate for each documented scenario, and keep the documentation on a regular basis as technology and business practices change.
Consolidate and Save
You have been told by vendors for years, and have finally started listening to those vendors, by consolidating your server environment to save money on hardware, licensing, cooling, power, etc. by using virtual servers and a SAN for storage. Now you have put all your eggs (very expensive eggs) into one basket. Do you treat that basket with the same level of importance as you should, knowing that all your data, all your emails, all your applications, everything important to your business is running on this small set of complicated components?
Complacency is the killer of a great incident recovery plan.
Complacency is a serious problem in IT environments across the globe. While no one is likely to die if a disaster strikes your server room, when disaster strikes people expect smart people to have a smart solution to the problem. When you are in the planning meetings, are you the one arguing that you can buy enough redundant parts and pieces to eliminate failure, or are you the one that accepts you can’t buy your way out of a potential system failure and insist on planning for an inevitable system failure?
But We Never Had A Failure Before
It can be difficult being the person that fights for the proper written plan for scenarios that everyone is telling you can’t possibly happen. It will be hard getting the team to test a plan when they insist those components can’t possibly fail, that one broken part can’t cause the system to stop responding, that the vendor won’t allow the system to to freeze during a system update, etc.
When was the last time you had a system failure that caused a major disaster? Most haven’t had to deal with a true disaster, and will often ask why we spend so much time and money planning for something that has never happened and probably will never happen. I’ve had people ask me why do I even worry about some of the scenarios I propose during my technical team meetings if the chances of those disasters are so remote and the cost of planning for those scenarios is so expensive. I usually answer that understanding what could happen is the first step toward understanding how much it could cost to prevent the issue. If you know that the chance of a hard drive failure is remote, you might not plan for the failure and so you save the money by not getting the level of support required for fast drive replacement, mounting hot spares, keeping cold spares on hand, etc. If you understand the chance of a hard drive failure is actually quite high, you might make different decisions. I’ll do what research I can to guide my team to understanding the different risk scenarios and allow the team to judge each risk and spend our money on the items of greatest risk.
This is actually incredibly easy. Some might say this is incredibly easy. All you have to do is get cynical, look for faults, and imagine what could possibly go wrong. You can either spend money preventing a disaster you might never actually see, or you might save a few dollars today and wish you had spent a lot more money in hindsight. Hindsight is cheap because anyone can look back and tell you what you could and should have done differently to better prepare for the disaster… after the disaster has already happened.
Plan for the very worst, and hope for the very best.
Redundancy Is Your Friend
If your environment diagrams look like a whole bunch of production systems or databases sitting on top of a carefully sculpted set of expensive single points of failure – you might be doing something incredibly wrong. Putting everything on an expensive SAN won’t mean you have a system that can’t fail.
So here is where you and your entire IT organization need to come together and plan the best shape of your environment – and you will end up taking known risks and making decisions based on risk and reward – but you will feel a lot better having that conversation and making informed decisions. You may not get to the level of fault tolerance that you want or is ideal for your paranoia, but at least you gave the risks and rewards and helped define the cost and benefit analysis up front. The point is, if you are the database administrator and you see a lot of room for failures you need to call those out and give suggested solutions, understand the business needs, discuss the pros and cons, and help architect the right solution for you and your workload.
The point is you need to understand the risk, configure your environment to minimize risk, and destroy all single points of failure.