Since the inception of virtual servers, people have been looking for the perfect way to backup and restore those systems. The current solution is basically to snapshot the image of the virtual server, identifying any changes since the last snapshot, then just copy this data to an alternate site for storage. In the event of a disaster, the snapshots can be copied back to the host server and restored to get the virtual servers back to the state at the point the last snapshot was taken. This system works fairly well because you can increase or decrease the snapshot frequency to establish the snapshot schedule that best meets your corporate requirements.
Another discussion that has been going on just as long is what is the best way to backup and restore databases on virtual servers. Some people, mostly SAN vendors, have been selling SAN snapshots as the solution to all backup needs, including database backups. Most people, including myself, have been suspicious of the success rate when relying on snapshots, mostly because it relies to much on the selected vendor solution to work 100% of the time.
In this article from
First rule of backups: Backup can’t depend on the production data.
Second rule of backups: Backups must be moved to another device.
With SAN snapshots the snapshot lives on the same device as the production data. If the production array fails (which happens), or gets decommissioned by accident (it’s happened), or tips over because the raised floor collapsed (it’s happened), or someone pulls the wrong disk from the array (it’s happened), or someone is showing off how good the RAID protection in the array is and pulls the wrong two disks (it’s happened), or two disks in the same RAID set fail at the same time (or close enough to each other than the volume rebuild doesn’t finish between them failing) (yep, that’s happened as well), etc. If any of these happen to you, it’s game over. You’ve just lost the production system, and the backups.
I’ve seen two of those happen in my career. The others that I’ve listed are all things which I’ve heard about happening at sites. Anything can happen. If it can happen it will (see item above about the GOD DAMN RAISED FLOOR collapsing under the array), so we hope for the best, but we plan for the worst.
Third rule of backups: OLTP systems need to able to be restored to any point in time.
See my entire post on SAN vendor’s version of “point in time” vs. the DBAs version of “point in time” .
If I can’t restore the database to whatever point in time I need to, and my SLA with the business says that I need to, then it’s game over.
So what is a SQL Server DBA supposed to do if he is forced to rely on SAN snapshots for his database backups? Test Everything. Don’t believe it when your system administrator tells you he is backing up your database instance every 15 minutes. Don’t believe him when he tells you the restoration of the server will only take about 30 minutes. Don’t believe him when he tells you he will know when the SAN fails, a drive has gone bad, or even that he knows what to do if the LUN fails and takes an entire group of virtual servers down.
Make sure you have practiced what steps are required to restore the server, a group of servers, or even how to restore your server to a different LUN. Write everything down and include the steps in your written Incident Recovery Plan. Also never give up on the idea of separate and full database backups to a different source. You may not have the budget this year, but you might get the money next year or the year after. Explaining the limitations of SAN snapshots can encourage reluctant managers to change backup expectations.