I’ll admit, Disaster Recovery and High Availability are almost an obsession with me. Having been a production DBA for a number of years and being in shops big and small with mission critical 24/7 applications to support can you blame me?
And while this topic has been written on and pontificated upon many times over, I’ve yet to see many companies do both Disaster Recovery and High Availability well. Not that they haven’t tried, but sometimes it is seen as good enough to have one or the other, forgetting that they are two sides to the same coin, and important in their own way. I assume it is because many people lump the two into the same category, and if you have one you have both, right? Well I think that is a dangerous way to think.
Lets start out with my working definition of the two, first up High Availability:
High Availability or HA is simply having enough redundancy built into a given datacenter to sustain massive hardware failure within that datacenter. Be it a Server’s Power Supply, a network Switch or Router, a disk drive or even an external internet connection.
Sounds like an oversimplification, but when distilled down that is all it is. Having a ‘fault tolerant’ infrastructure that can absorb the hit of having equipment break. Be it an entire server or a sub-system of a server. This varies by application and by company.
For example, if you have some important applications that can tolerate some down time, you may not want to put in the expense of a MSCS cluster, or a VMWare ESX cluster, rather you could employ redundant power, RAID, and having enough spares and scripts to move the functionality to another piece of equipment.
What if there is little to no tolerance for downtime? My approach is simple, purchase the most industrial grade hardware you can, redundant power, networking, storage and controllers, then duplicate it. VMWare ESX clustering is pretty cool if you make sure to be N+1 with your nodes. Microsoft Clustering Services is also pretty interesting. And contrary to VMWare’s insistence I have used both on the same servers.
If you think about it, even the clustering has different usages. With VMWare, sure you can move the virtual server in the case of hardware failure, but what of OS failure? What about patching? Sure you can take a snapshot, and yes VMWare reboots are fast. But, if the patching doesn’t work well and you have to revert back a snapshot and reboot and what not, the server is down that entire time. With MSCS clustering you can simply make sure all services are moved off of one node of the cluster, patch it, and then test it. If everything went horribly wrong, you still have a fully functioning node and were only down for about 2 minutes while the cluster decided to not allow the service move.
So would I take Microsoft Clustering over ESX Clustering? In a word? No. ESX clustering gives you so many advantages, allowing seamless upgrades to hardware (Memory, CPU etc) by simply vMotioning the VM off of one host to another, and using VMWares tools to up the amount of RAM or number of CPUs.
Would I rely solely upon VM Clustering and ditch MSCS (or other OS level clustering)? No. My most robust SQL installs used both. MSCS clustering allowed me to patch an entire OS while the cluster was still up, then do a test by moving the service over to the patched node and see if it worked. I would typically let it run a week to decide if I liked it, then I would patch the other node and we’d be 100% up to date.
So this is all great stuff, how is this different from Disaster Recovery? Isn’t it a disaster to have an entire server blow up? Yes and no. The scope of Disaster is much broader. A Disaster in this context is where the entire datacenter goes away. Be it a Russian space station or a huge rock falling from the sky. Or, more typically some sort of storm, hurricane, tornado or an earthquake. In the case that a Hurricane comes through and floods your datacenter, or a tornado blows it off the face of the map all the high availability technology won’t do you any good.
Disaster Recovery, in essence, is having a complete datacenter in an area that is far away. Now, I’ve read studies that state that you can have the datacenters as close as 10 miles away, but lets be safe and make it 100 or 200 miles. You don’t need one in California and the other in Chicago. Where I live, in North Carolina, it is okay to have a data center in Charlotte and one in Raleigh. Look at some of the most wide spread damage, Hurricanes. Hugo hit in 1989, Raleigh was fine, Charlotte was in the line of fire. In 1996 it was Fran and Raleigh was in the cross heirs, while Charlotte was a-okay.
DR doesn’t have to be an exact copy of production. That would be very expensive, essentially doubling your hardware costs. Yes, you do need to have the same amount of Storage, for example, but the speed of the storage doesn’t need to be as good as production. Your servers also don’t need to be as powerful. The idea is that you have something that will work while you rebuild your production environment.
Here is where they both are similar. Neither HA or DR solutions are any good unless they are tested, and tested regularly. I have a saying when it comes to backups. Your backups are only as good as your last restore. What that means is that sure you may have a backup from last night, but you don’t really know it will work. It might, it probably will, but are you 100% sure? No! The only backup you KNOW works is the last one that you successfully restored.
The same is true for both High Availability and Disaster Recovery. Unless you go and pull plugs you don’t know if your teamed NICS or redundant power supplies are any good. Unless you fail-over to the other node in your cluster you won’t know if it will run the services. I’ve seen this before, where one node went unused for a long period of time, and then an event occurred and the clustered service attempted to fail over, only to have a dependance not come online. Oops. iSCSI decided to drop the disk and bam, your HA cluster is now useless both nodes dead, and you are getting a frantic call in the middle of the night.
Same goes for DR. Fire it up, test it out. Make sure it is current with logins/passwords, patches and code for your applications and the most current data is there. What good is a DR site if the data is months old or nobody can login?
When building HA and DR plans there are a few go-to questions I use.
1) how important is your data?
2) how much does an hour of total downtime cost?
3) What is the cost per hour by sub-system breakdown look like?
(many companies have many smaller applications, some may be mission critical
while some are nice to have, but the company can live without them a day or to
without losing any money)
4) Will we go out of business if we are 100% down for 14 days? 7 days? 2 days?
5) What sort of realistic budget do we have?
I tend to ask the value of data and time first because when you approach a business with the prospect of lost money, they get serious and they realize that you want them to succeed.
Once you have the basic questions down, you can figure out what the realistic options are. I once had a manager who wanted me to scope out the pie in the sky, soup to nuts. So I did. We duplicated 100% what we had in production, and then added all of the bits to failover and failback as well as the bandwidth needed to keep the DR site up in very near real time.
The cost was staggering. I think it made the business people stop and think that they would rather risk being down, because there wasn’t enough money for that project.
And while it is really nice to have all the bells and whistles, it is MORE important to have DR. Without it you are really one bad storm away from not having a company to work for... and worse yet, all of the people who work for the company will also be out of a job, and any customers that relied on your goods or services would also be harmed.
There are plenty of sites that talk about the details of both, and I would be happy to talk to you about them as well. I am a DBA, so I can tell you from a data-centric point of view what all is needed, but the networking, storage and virtualization specifics are really black magic to me!
Labels: Disaster Recovery, DR, HA, High Availability, Information Technology, SQL, SQL Server