Thursday, June 14, 2018

The mysteries of HA and DR

In this day and age where nearly every company in the US (perhaps the world)  relies upon information technology and may even have its own servers be they on premises, in a data center or in the cloud, the topics of HA (High Availability) and DR (Disaster Recovery) are more important than ever.
And yet as important as these topics are, there is still much confusion about both.  I had a conversation today with somebody who was very technical and had the concepts confused.    With the advent of Cloud Computing they are even more confusing.

The infusions of ads from companies like Google, Microsoft, IBM and Amazon Web Services paint a rosy picture but do little to actually lift the fog.

Let me start at the most basic conceptual level.

High Availability
The ability for any system to quickly recover from a failure of required services. Be they hardware, operating system, networking or power.

Disaster Recovery
The ability to recover from the complete loss of your primary datacenter and all the hardware contained within.

These two concepts can exist completely independent of each other or can be intertwined.  To have one, you do not have to have the other and in fact many companies may actually have neither or just parts of each implemented.

HA would require absolute redundancy in all avenues within the Data Center where the hardware is located.
Redundant power.  at least 2 sources of power into the data center with backup generators and batteries (UPS).  2 sources of power into the cabinet on different legs from within the datacenter. 2 power supplies to each device within the cabinet (Servers, Switches, Routers, SAN etc)

Redundant network.  Multiple sources of connectivity outside of the datacenter (internet), multiple switches, routers, etc all hooked into all servers which will have multiple interface cards.

Redundant Storage.  RAIDed drives on a SAN with redundant transmission technology be it Fiber Channel or SCSI or iSCSi.  The SAN must have redundant controllers etc.

Redundant Servers.  Yes after all of this each server is still a single point of failure.  Therefore redundant servers are a must. And no it isn’t overkill to have redundant servers in a virtualized environment as the OS can become corrupt as well.

All of this redundancy, if configured properly, can protect against a multitude of failures from within the data center.  And if choosing priorities do have HA or DR I would start here with HA as you are far more likely to need a HA solution.

All of this does not mean a good backup (and restore) plan is no longer needed.  HA does not protect against deleted files or corrupt data nor does it protect against user mistakes or deliberate data vandalism. As a matter of fact a good backup plan can also be the start of the most basic of DR.

What HA is for the datacenter, DR is for the entire organization.
Disaster Recovery takes all of the enterprise's mission critical operations both technical and non-technical.  This is an all encompassing plan that should be well documented, well practices and frequently updated.

DR is life insurance for your company.  You really don’t want to use it, but without it your company can fail in the event of a datacenter failure.  The good news is that there are many different paths to DR, and that depending on your RPO/RTO (Recovery Time Objective and Recovery Point Objective)  you can have the most simple of DR plans to something very elaborate. The important thing is to have a plan and have it implemented.

Your DR plan could simply be keep backups offsite.  Backups of data, files, install media and license keys along with documentation on how all the configuration and contact information on who to purchase equipment from.  This could take a month or more to implement but at least it is a plan, and is inexpensive and at a bare minimum I would implement this any organization this very moment. Recovery may take weeks or maybe even months using this method depending on how long it takes to procure hardware, get a new data center negotiate and get contracts in place etc. If your company can not survive without its IT systems this long, this is only a start and you will need to go further.
If your DR plan is to have some older servers stashed away in a building that is on the other side of town (more than 10 miles away) with its own power, networking gear and the like and have it powered off, but with recent good backups (good = tested) that gets periodically moved into the DR site, that is indeed a DR Plan.  It may take weeks to get right but your company can keep moving.

If your company cannot survive a week without its IT systems, you may need something better than cold standby.  At that point you’d want warm standby. Where all the network gear and servers are up, running and configured, where your data is incrementally moved over from production to DR.

Where does the Cloud enter into all of this.

The Cloud, if you believe all the hype, will protect your organization from all of this.  

The cloud is, in reality, simply a bunch of equipment in a datacenter that many different companies can use and that may or may not be open to global connectivity.  In essence it is simply virtualized hardware that has some aspects of HA and perhaps even DR built in. However just leaning on the cloud and trusting it to protect you would not be the wisest course of action.  There still needs to be a plan for both HA and DR in place, something that is tested and updated regularly.

Most good Cloud providers are, at a minimum, highly available to an extent.  Meaning that if you build out a few servers in the cloud, they will have the ability to be hosted by many different physical hosts with redundant hardware, power, networking.  The VM may not have redundant OS, but if the server can be down for a few hours you could rely on VM Backups to recover it. If not, consider some form of clustering or load balancing to cover when good OS’s go bad.

Where the cloud gets more confusing is with DR.  Many top tier cloud providers do indeed provide DRaaS or Disaster Recovery as a Solution.  This is a great start. It is a good way to get much of your environment into a redundant datacenter.  However, there are some odd systems such as Exchange, Sharepoint, and SQL Server that may not take nicely to many DRaaS solutions.   This is where your organization needs to both trust and verify what your vendor says. Your cloud vendor should be more than happy to help you build and test DR if you have purchased their DRaaS package, that includes making sure every service you require is up and running in the DR environment.  I cannot recommend enough that this is tested prior to migration. Even if it is just a few simple servers in a configuration that basically matches what your organization has, it is worth testing the plan prior to migration.

If you are already in with the vendor it is still worth building a test DRaaS environment to make sure that everything works before going live with it.  

Once live it is imperative to test both HA and DR solutions frequently.  Yes, there are indeed risks with the tests. I can tell you as a Data Professional when I have clients testing DR I always feel a bit nervous.  Take your time, go step by step, make sure everything done can be undone and you will mitigate most of the risk. Remember that you are planning for the survival of your business should a disaster arise.  

If you’ve stuck with me this long let me leave you with this thought:
How important are the information systems your company has to daily operations?
How important is the data stored in these various systems?
How long can your company stay in business with a 100% outage.
What systems are most critical to your daily business, and which systems can wait?

Planning HA and DR is all about compromise and priority.  It can be very expensive. Going out of business can be far more costly.  Think of your customers, coworkers and business partners and decide how to best serve them by making sure your systems can withstand failures both within and without the data center.

You may notice that I did not mention specific HA or DR technology any time in this post. That was quite on purpose. There are many different technologies that work on many different levels, however in my experience DR requires a few different technologies working in concert as there are shortcomings in many solutions and finding the right mix and match is imperative.