Thursday, June 14, 2018

The mysteries of HA and DR


In this day and age where nearly every company in the US (perhaps the world)  relies upon information technology and may even have its own servers be they on premises, in a data center or in the cloud, the topics of HA (High Availability) and DR (Disaster Recovery) are more important than ever.
And yet as important as these topics are, there is still much confusion about both.  I had a conversation today with somebody who was very technical and had the concepts confused.    With the advent of Cloud Computing they are even more confusing.

The infusions of ads from companies like Google, Microsoft, IBM and Amazon Web Services paint a rosy picture but do little to actually lift the fog.

Let me start at the most basic conceptual level.

High Availability
The ability for any system to quickly recover from a failure of required services. Be they hardware, operating system, networking or power.

Disaster Recovery
The ability to recover from the complete loss of your primary datacenter and all the hardware contained within.

These two concepts can exist completely independent of each other or can be intertwined.  To have one, you do not have to have the other and in fact many companies may actually have neither or just parts of each implemented.

HA would require absolute redundancy in all avenues within the Data Center where the hardware is located.
Redundant power.  at least 2 sources of power into the data center with backup generators and batteries (UPS).  2 sources of power into the cabinet on different legs from within the datacenter. 2 power supplies to each device within the cabinet (Servers, Switches, Routers, SAN etc)

Redundant network.  Multiple sources of connectivity outside of the datacenter (internet), multiple switches, routers, etc all hooked into all servers which will have multiple interface cards.

Redundant Storage.  RAIDed drives on a SAN with redundant transmission technology be it Fiber Channel or SCSI or iSCSi.  The SAN must have redundant controllers etc.

Redundant Servers.  Yes after all of this each server is still a single point of failure.  Therefore redundant servers are a must. And no it isn’t overkill to have redundant servers in a virtualized environment as the OS can become corrupt as well.

All of this redundancy, if configured properly, can protect against a multitude of failures from within the data center.  And if choosing priorities do have HA or DR I would start here with HA as you are far more likely to need a HA solution.

All of this does not mean a good backup (and restore) plan is no longer needed.  HA does not protect against deleted files or corrupt data nor does it protect against user mistakes or deliberate data vandalism. As a matter of fact a good backup plan can also be the start of the most basic of DR.

What HA is for the datacenter, DR is for the entire organization.
Disaster Recovery takes all of the enterprise's mission critical operations both technical and non-technical.  This is an all encompassing plan that should be well documented, well practices and frequently updated.

DR is life insurance for your company.  You really don’t want to use it, but without it your company can fail in the event of a datacenter failure.  The good news is that there are many different paths to DR, and that depending on your RPO/RTO (Recovery Time Objective and Recovery Point Objective)  you can have the most simple of DR plans to something very elaborate. The important thing is to have a plan and have it implemented.

Your DR plan could simply be keep backups offsite.  Backups of data, files, install media and license keys along with documentation on how all the configuration and contact information on who to purchase equipment from.  This could take a month or more to implement but at least it is a plan, and is inexpensive and at a bare minimum I would implement this any organization this very moment. Recovery may take weeks or maybe even months using this method depending on how long it takes to procure hardware, get a new data center negotiate and get contracts in place etc. If your company can not survive without its IT systems this long, this is only a start and you will need to go further.
If your DR plan is to have some older servers stashed away in a building that is on the other side of town (more than 10 miles away) with its own power, networking gear and the like and have it powered off, but with recent good backups (good = tested) that gets periodically moved into the DR site, that is indeed a DR Plan.  It may take weeks to get right but your company can keep moving.

If your company cannot survive a week without its IT systems, you may need something better than cold standby.  At that point you’d want warm standby. Where all the network gear and servers are up, running and configured, where your data is incrementally moved over from production to DR.

Where does the Cloud enter into all of this.

The Cloud, if you believe all the hype, will protect your organization from all of this.  

The cloud is, in reality, simply a bunch of equipment in a datacenter that many different companies can use and that may or may not be open to global connectivity.  In essence it is simply virtualized hardware that has some aspects of HA and perhaps even DR built in. However just leaning on the cloud and trusting it to protect you would not be the wisest course of action.  There still needs to be a plan for both HA and DR in place, something that is tested and updated regularly.

Most good Cloud providers are, at a minimum, highly available to an extent.  Meaning that if you build out a few servers in the cloud, they will have the ability to be hosted by many different physical hosts with redundant hardware, power, networking.  The VM may not have redundant OS, but if the server can be down for a few hours you could rely on VM Backups to recover it. If not, consider some form of clustering or load balancing to cover when good OS’s go bad.

Where the cloud gets more confusing is with DR.  Many top tier cloud providers do indeed provide DRaaS or Disaster Recovery as a Solution.  This is a great start. It is a good way to get much of your environment into a redundant datacenter.  However, there are some odd systems such as Exchange, Sharepoint, and SQL Server that may not take nicely to many DRaaS solutions.   This is where your organization needs to both trust and verify what your vendor says. Your cloud vendor should be more than happy to help you build and test DR if you have purchased their DRaaS package, that includes making sure every service you require is up and running in the DR environment.  I cannot recommend enough that this is tested prior to migration. Even if it is just a few simple servers in a configuration that basically matches what your organization has, it is worth testing the plan prior to migration.

If you are already in with the vendor it is still worth building a test DRaaS environment to make sure that everything works before going live with it.  

Once live it is imperative to test both HA and DR solutions frequently.  Yes, there are indeed risks with the tests. I can tell you as a Data Professional when I have clients testing DR I always feel a bit nervous.  Take your time, go step by step, make sure everything done can be undone and you will mitigate most of the risk. Remember that you are planning for the survival of your business should a disaster arise.  


If you’ve stuck with me this long let me leave you with this thought:
How important are the information systems your company has to daily operations?
How important is the data stored in these various systems?
How long can your company stay in business with a 100% outage.
What systems are most critical to your daily business, and which systems can wait?

Planning HA and DR is all about compromise and priority.  It can be very expensive. Going out of business can be far more costly.  Think of your customers, coworkers and business partners and decide how to best serve them by making sure your systems can withstand failures both within and without the data center.

You may notice that I did not mention specific HA or DR technology any time in this post. That was quite on purpose. There are many different technologies that work on many different levels, however in my experience DR requires a few different technologies working in concert as there are shortcomings in many solutions and finding the right mix and match is imperative.

Wednesday, May 02, 2018

(Mis) adventures in Always On Availability groups

(Mis) adventures in Always On Availability groups
As almost a part 2 to my last post I have done more AOAG testing.  To say that I’m a little shocked at the results may well be the understatement of the hour.

As I wrote before, I had something strange happen with a customer’s AOAG setup that left a small subset of their databases unable to sync up with the primary replica.  Since this was in production I had precious little I could do to rectify the situation.  I did find a somewhat satisfactory fix for the issue, but the question lingered, how delicate is Always on really?

I set up a fresh AOAG cluster Windows 2016 and SQL 2016.  I setup the server to a standard install for my company, got the AOAG setup but failed to setup quorum (on purpose)   as that is how I inherited the customer cluster that I first saw the problem. 

First I disabled the NIC on the secondary replica.  The primary kept on trucking for a while I did some inserts, updates and even schema changes for good measure.  Then I enabled the NIC.  The Secondary Replica struggled, sputtered and then came to life without any real intervention.  That was good.

I then went for the gusto, a former colleague of mine informed me when he did this test it killed the AOAG, so I tried…..  And disabled the primary replica’s NIC.

I wish there were fireworks, or explosions or something more noteable to denote what happened… but alas, all that happpend is that the primary kept chugging, and the secondary replica…. Died.    The DB entered a “Not Synchronizing Recovery Pending” state.  And stayed there for the better part of an hour before I intervened. 

I decided to just delete the AG.  big fat mistake.  That took the primary database to a “Restoring” point.  Oops.

Then I tried the entire test all over.  Prior to setting up the AG I set the quorum up properly.  Repeated the test with both replicas in Synchronous mode (as they were before)  and this time.. Bam, it simpy failed over as expected…  then failed back when I ran the test again, and again, and again, and again and (I think you get the point here).

AOAG is okay.  That said I’ve never had major issues with Mirroring or log shipping.  The one thing I wished they had out of the box was easy failover/failback… but that is easily enough scripted out.

Would I jump ship and upgrade just to get AOAG?  No.  That said, Mirroring is deprecated so when the normal upgrade cycle for your environment comes up that will be the time to move.

Thanks Microsoft...

Labels: , ,

Wednesday, April 25, 2018

He's Back.... and he has an Always On post

I had an AOAG issue yesterday, the 'fix' was to remove and re-add the DB's to the AG..... I tried simply removing and rejoining the 3rd replica but alas that failed.  My best guess is something in the metadata on the primary was hosed and not reading right.

Cause
Network outage due to changes made by another engineer caused a 'blip' in AOAG.  There were a slew of errors where the SQL Servers complained about not being able to talk to each other, and then the AGs were back online and syncing

Symptoms
All 4 AGs were online, with #4 having 12 of 15 databases syncing just fine, 3 would not.  There were no errors in the log when everything came back.

There were no errors in the log when I attempted steps 1 and 2 down below.

When I did step 4 I got the dreaded 35250 error.. BUT as I said 12 of 15 DB's were working in this AG, as well as 3 other AG's on the same primary and secondary replicas.

SO it wasn't the usual suspects of port not open, or permissions issues or anything of the like.

Things I tried
1) HADR Resume from secondary and primary
ALTER DATABASE database_name SET HADR RESUME

2) Suspend and Resume on Primary
ALTER DATABASE database_name SET HADR suspend

ALTER DATABASE database_name SET HADR RESUME

3)  Restarting Secondary Replica (and endpoints as well)
4) Removing and re-adding Replica
5) Removing DB from AG and re-adding DB to AG (which worked)

/*Check health of AOAG*/

SELECT
       ag.name
     , sb.name
     , ar.replica_server_name
     , ar.availability_mode_desc
     , drs.is_local
     , drs.is_primary_replica
     , drs.synchronization_state_desc
     , drs.synchronization_health_desc
FROM sys.dm_hadr_database_replica_states drs
     INNER JOIN sys.databases sb
          ON drs.database_id = sb.database_id
     INNER JOIN sys.availability_groups ag
          ON ag.group_id = drs.group_id
     INNER JOIN sys.availability_replicas ar
          ON ar.replica_id = drs.replica_id
ORDER BY
         ag.name
       , sb.name
       , ar.replica_server_name;

Labels: , , ,

Wednesday, June 05, 2013

High Availability vs Disaster Recovery. What are they, and do you really need them?

I’ll admit, Disaster Recovery and High Availability are almost an obsession with me.  Having been a production DBA for a number of years and being in shops big and small with mission critical 24/7 applications to support can you blame me?


And while this topic has been written on and pontificated upon many times over, I’ve yet to see many companies do both Disaster Recovery and High Availability well.  Not that they haven’t tried, but sometimes it is seen as good enough to have one or the other, forgetting that they are two sides to the same coin, and important in their own way.  I assume it is because many people lump the two into the same category, and if you have one you have both, right?  Well I think that is a dangerous way to think.


Lets start out with my working definition of the two, first up High Availability:
High Availability or HA is simply having enough redundancy built into a given datacenter to sustain massive hardware failure within that datacenter.   Be it a Server’s Power Supply, a network Switch or Router, a disk drive or even an external internet connection.


Sounds like an oversimplification, but when distilled down that is all it is. Having a ‘fault tolerant’ infrastructure that can absorb the hit of having equipment break.  Be it an entire server or a sub-system of a server. This varies by application and by company.  


For example, if you have some important applications that can tolerate some down time, you may not want to put in the expense of a MSCS cluster, or a VMWare ESX cluster, rather you could employ redundant power, RAID, and having enough spares and scripts to move the functionality to another piece of equipment.


What if there is little to no tolerance for downtime?  My approach is simple, purchase the most industrial grade hardware you can, redundant power, networking, storage and controllers, then duplicate it.  VMWare ESX clustering is pretty cool if you make sure to be N+1 with your nodes.  Microsoft Clustering Services is also pretty interesting.  And contrary to VMWare’s insistence I have used both on the same servers.


If you think about it, even the clustering has different usages.  With VMWare, sure you can move the virtual server in the case of hardware failure, but what of OS failure? What about patching?  Sure you can take a snapshot, and yes VMWare reboots are fast.  But, if the patching doesn’t work well and you have to revert back a snapshot and reboot and what not, the server is down that entire time.  With MSCS clustering you can simply make sure all services are moved off of one node of the cluster, patch it, and then test it.  If everything went horribly wrong, you still have a fully functioning node and were only down for about 2 minutes while the cluster decided to not allow the service move.


So would I take Microsoft Clustering over ESX Clustering?  In a word?  No.  ESX clustering gives you so many advantages, allowing seamless upgrades to  hardware (Memory, CPU etc) by simply vMotioning the VM off of one host to another, and using VMWares tools to up the amount of RAM or number of CPUs.  


Would I rely solely upon VM Clustering and ditch MSCS (or other OS level clustering)?  No.  My most robust SQL installs used both.   MSCS clustering allowed me to patch an entire OS while the cluster was still up, then do a test by moving the service over to the patched node and see if it worked.  I would typically let it run a week to decide if I liked it, then I would patch the other node and we’d be 100% up to date.


So this is all great stuff, how is this different from Disaster Recovery?  Isn’t it a disaster to have an entire server blow up?  Yes and no.  The scope of Disaster is much broader.  A Disaster in this context is where the entire datacenter goes away.  Be it a Russian space station or a huge rock falling from the sky.  Or, more typically some sort of storm, hurricane, tornado or an earthquake.    In the case that a Hurricane comes through and floods your datacenter, or a tornado blows it off the face of the map all the high availability technology won’t do you any good.


Disaster Recovery, in essence, is having a complete datacenter in an area that is far away.  Now, I’ve read studies that state that you can have the datacenters as close as 10 miles away, but lets be safe and make it 100 or 200 miles.  You don’t need one in California and the other in Chicago.  Where I live, in North Carolina, it is okay to have a data center in Charlotte and one in Raleigh. Look at some of the most wide spread damage, Hurricanes.  Hugo hit in 1989, Raleigh was fine, Charlotte was in the line of fire.  In 1996 it was Fran and Raleigh was in the cross heirs, while Charlotte was a-okay.


DR doesn’t have to be an exact copy of production.  That would be very expensive, essentially doubling your hardware costs.  Yes, you do need to have the same amount of Storage, for example, but the speed of the storage doesn’t need to be as good as production.  Your servers also don’t need to be as powerful.  The idea is that you have something that will work while you rebuild your production environment.


Here is where they both are similar.  Neither HA or DR solutions are any good unless they are tested, and tested regularly.  I have a saying when it comes to backups.  Your backups are only as good as your last restore.  What that means is that sure you may have a backup from last night, but you don’t really know it will work.  It might, it probably will, but are you 100% sure?  No!  The only backup you KNOW works is the last one that you successfully restored.


The same is true for both High Availability and Disaster Recovery.   Unless you go and pull plugs you don’t know if your teamed NICS or redundant power supplies are any good.  Unless you fail-over to the other node in your cluster you won’t know if it will run the services.  I’ve seen this before, where one node went unused for a long period of time, and then an event occurred and the clustered service attempted to fail over, only to have a dependance not come online.  Oops.  iSCSI decided to drop the disk and bam, your HA cluster is now useless both nodes dead, and you are getting a frantic call in the middle of the night.


Same goes for DR.  Fire it up, test it out.  Make sure it is current with logins/passwords, patches and code for your applications and the most current data is there.   What good is a DR site if the data is months old or nobody can login?


When building HA and DR plans there are a few go-to questions I use.


1) how important is your data?
2) how much does an hour of total downtime cost?
3) What is the cost per hour by sub-system breakdown look like?
(many companies have many smaller applications, some may be mission critical
while some are nice to have, but the company can live without them a day or to
without losing any money)
4) Will we go out of business if we are 100% down for 14 days?  7 days?  2 days?
5) What sort of realistic budget do we have?


I tend to ask the value of data and time first because when you approach a business with the prospect of lost money, they get serious and they realize that you want them to succeed.  


Once you have the basic questions down, you can figure out what the realistic options are.  I once had a manager who wanted me to scope out the pie in the sky, soup to nuts.  So I did.  We duplicated 100% what we had in production, and then added all of the bits to failover and failback as well as the bandwidth needed to keep the DR site up in very near real time.


The cost was staggering.  I think it made the business people stop and think that they would rather risk being down, because there wasn’t enough money for that project.


And while it is really nice to have all the bells and whistles, it is MORE important to have DR.  Without it you are really one bad storm away from not having a company to work for... and worse yet, all of the people who work for the company will also be out of a job, and any customers that relied on your goods or services would also be harmed.


There are plenty of sites that talk about the details of both, and I would be happy to talk to you about them as well.  I am a DBA, so I can tell you from a data-centric point of view what all is needed, but the networking, storage and virtualization specifics are really black magic to me!

Labels: , , , , , ,

Tuesday, April 30, 2013

Part 5 DBAs distribute data to the 4 corners...


Another role a DBA plays is distributor of data.  Keeping the data safe, well fed and maintained, accessible and highly available are all well and good BUT what if a giant meteor were to fall from the sky and destroy the datacenter where your server lives(ed?)?

Well many companies would simply go out of business, sure they would scramble trying to scrape together hardware that can keep things moving, the poor DBA will hunt down the latest possible backups and desperately try to remember all of the settings and gotchas in building up the environment.  But in the end, it will take days, if not weeks to get back up, and by that time your customers have moved on.  Your company may survive, but many people will be out of work while the company rebuilds.

A huge part of a DBAs life is to make sure that this doesn’t happen.  No, I don’t mean building some sort of laser or rocket defense against falling objects from space.  But, the DBA needs to drive and work with other departments to create a plan, document the plan, and implement it.

Sometimes budgetary constraints mean that a duplicate datacenter is simply not feasible.  That is no excuse for not having a plan.  This plan should be documented in soft and hard copies, and kept someplace other than the office and the data center.   It needs to be updated at least bi-annually if not quarterly.  

The DBA needs to have it clearly spelled out what the minimum requirements are for being up and afloat.  DR doesn’t have to fully support all functionality, nor does it have to be as fast as production, but it needs to support all mission critical functions, and do so with only a slight degradation in performance.  

Your DBA SHOULD know the business well enough to make many decisions on how to accomplish this, however, you as a business person, can help them by answering any and all questions about what is mission critical AND be honest.  You can’t have everything.  Pretend like you can only have one thing, what would that one thing be.  Then work down the list.  DR is all about minimizing exposure and minimizing risk.  But reality dictates that most organizations simply won’t be able to spend enough to duplicate the total environment.  

Now, you may ask, why should a DBA spend so much time on something that you don’t want to use?  Why do you have homeowners insurance?  Why do you have live insurance?  Why do you have major medical?  Of course you do.  Perhaps not enough, but you have some coverage.  Having a DBA spend the proper time on this task is crucial to the survival of your business in the event of catastrophe.  

Again, this is why your DBA has a negative outlook frequently.  DBA’s constantly have to look at all of the possibilities of what can go wrong, how it can go wrong and what the fallout is when it goes wrong.  In a world that wants to see the positive in everything I hope you can appreciate your DBA and his negative outlook.  Yes, he can rain on any parade, but you need somebody who sees what is broken and can come up with inventive and constructive ways to fix it.

It is important to realize that DBAs have to have a long term view, and sometimes they cannot (should not) allow little fires here and there distract them from the big picture, or a literal little fire under your server cabinet could really ruin the day of many people working for you.

It is important that business people allow the DBA to work on this aspect of their job, and even have them schedule time to focus on it.  Ask your DBA about their plans, ask them to show you what they are thinking of, ask them to help the company live on and people keep their jobs in the face of disaster.


Labels: , , ,

Monday, April 29, 2013

DBA Part 4 Architect and Builder


Another metaphor, perhaps even a labored one!  


DBA’s are Designers and builders.  Even if you have a system in place it needs renovation or perhaps is close to end of life.  A DBA always is looking to improve things.  Tweaks to make things faster, more safe, more reliable, more available.


There are two key ways DBAs look at the database world.  One is very application centric and deals with tables, columns rows and relationships.  The other is very infrastructure (Servers, disks, networking) centered revolving around GHz, gpbs, GB and so forth and so on.

Again the DBA must know the business and the application.  Working with a Business Analyst (or being one himself as I am a BA too) the DBA needs to learn as much about how business works in order to know how to best store the data in tables.   Now, in many shops programmers do this task, and some even do a good job.  But in most shops that have a DBA, the DBA is brought in to bless the data design in ER diagram form before being tasked to actually build it.

In an earlier installment of this series I mentioned the term Normalization.  This is the time in the development life cycle (SDLC) that the DBA will employ normalization the most.  Looking at the data that the company wishes to store element by element, seeing how it all relates and then weeding out redundancy and dependency until all that is left is a lean, mean, data storage machine.  Or database.   This takes time, thought, printouts, many hours of staring at printouts,  a few key moments of head hitting wall, and then more thought.

Once the design is done (assuming no emergencies are currently under way, no new software is being deployed, all maint tasks have been done properly and there are no end users with pending query requests lurking about) , the DBA needs to meet with the development staff and explain the design.

I once had the misfortune of designing a very clever schema that was very powerful, flexible and fit the need just right, but didn’t have the time to explain it.  My developer (who happened to be, in this case, my boss’ boss) claimed to understand the design.  I assumed he was right and moved on to other DBA like tasks (napping perhaps or donut eating) and left said developer to their own devices.  Fast Forward to 3 days before go live.  There was no code review, no testing, no nothing.  The developer was in my office hot under the collar because he couldn’t get something to work right.  I pointed out that it was clear he misunderstood the design, and had put in place many hacks and workarounds to my elegant design, and that was causing unneeded complexity and was the cause of his bug.

This revelation went over like a ton of bricks to which he told me
1) that he “Hated me in the marital way” and
2) would NOT ever fix it the right way because he had invested too much time in the wrong way.

Lesson?  Always take the time to talk to your developers, even if they claim to understand.  To you developers in the audience, please make sure you request a walk through, and understand it fully.  This database went into production, caused so many issues that I re-redesigned it, handed it off to proper developers EXPLAINED it fully, and now there is a much better application sitting on top of this awesome database.  This bug was my fault because I did not make sure that the user of my design understood it clearly before setting off.  When your DBA explains things to you and tries many ways to make it clear, please realize he is just trying to avoid any missteps.  

DBAs also must make sure that everything that the database is sitting on meets the needs of the applications that rely on the database.

This means that the DBA has to walk down to the guys who deal with nuts and bolts.  In this unsavory world people do actual physical labor including, but not limited to, lifting heavy metal boxes, and plugging cables into them.

Most end users aren’t brave enough to even talk to these people, and for good reason, they are scary!  The DBA has to make sure to not only talk to them, but *gasp* develop a great working relationship with them.

While wearing this hat, the DBA must use his knowledge of the applications, the company and how databases work, coupled with his knowledge of server hardware, operating system, networking and storage to help to design a solution that meets today’s needs AND will cover future growth.  

Yup, DBAs also must look into the crystal ball and predict the future.  I did that with a few key metrics on data growth, transactions per time period and looking at the sales pipeline and talking to the execs in Sales as well as the CEO to make sure I understood where the company was going.  If your DBA wants to talk business, indulge them, educate them, help them to understand what is important to your business.  They are just trying to be as educated as possible so that they can make good decisions.

Armed with this, a plan (or three) are developed.  First there is the “Money is no object” plan.  This one is fun because you can look at all sorts of exotic technologies and put together the killer system.  There is the middle of the road that gives you nice features but costs less, and then the bare bones, what can you get away with and make due with.

Once the plans are in place, the DBA will then have to defend decisions and assumptions, and therefore must be armed with data (ironic?) .  Being armed with data is one thing.  Another is being able to make cogent arguments that hit at the core.  Always hit them where it counts, in the money belt.  Yes you are asking to spend a lot of money, but what happens if the data isn’t available, what happens if it isn’t retrieved as quickly as possible, what happens if the company grows faster than expected and then they have to spend EVEN MORE money on a solution that they could have had in place to start with.

Business people speak in terms of money. DBAs don’t.  DBAs need to, but they simply don’t think that way.  Remember when speaking to a DBA about this wiz-bang infrastructure that he has planned that he does indeed want the company to succeed, and wants to do so without spending too much money and without working too hard.  We are people too, and I do realize every penny I spend on a new SAN means one less penny for my pay check or bonus!

This is, in my opinion, the most fun a DBA can have.  The work is almost all theoretical, and has no current impact.  Nothing breaks when pencil hits paper, there is no down time, no risk.  Just research into what is, and what should be.  Both in the technology realm, and in the business realm.  If the DBA seems carefree and smiles, chances are there are no looming emergencies or maintenance windows AND he is busy designing the next generation for the company.  That is a good time for a DBA.  At least for a brief moment when a possible issue that they thought of today is not currently handled, and that will cause another sleepless night!

Labels: , ,

Friday, April 26, 2013

DBA Part 3 Pit Crew of the company


I am a die-hard Formula 1 and LeMans style endurance racing fan.  I watch every F1 race and qualifying session broadcast in the US every year, and have for over 10 years.  I watch ALMS and Le Mans style racing too.

They are vastly different forms of racing that require a different approach and have different rules, yet one thing they have in common is that the less time spent in the pits with the car being serviced the better chance they have at victory.

In many ways a DBA is like this as well.  More often than not these days, a company’s data needs to be accessible 24 hours a day, 7 days a week, 365.25 days a year.  

Not only must it be accessible, but retrieval must be quick, and the data must be right.

What this means for your friendly local DBA is that every move must be planned out with precision, and there are tasks that must happen to keep everything smooth.  Granted, any good (read lazy) DBA will automate as many of these as possible, and then keep tabs on how the automation is running with logs and reports.  A big portion of the DBA’s time is spent on making sure that THEY get the data they need on how things are working.

In this (labored) metaphor, the database server (or data) is like the race car, the driver would be the end user and the DBA is the pit crew.  DBA’s do not drive the car, we are not the car, we build, maintain and fix the car as need arises.

When the server (car) enters the race (goes into production) it is costly to bring the server in for repair.  Be that the tasks mentioned before about index building or maintenance, or checking for corruption.

There are a few tools your DBA has in his toolbag for this.  First and foremost is knowledge of your business cycle, the daily, weekly, monthly and quarterly cycles of your business is crucial.  You may hear your DBA ask all sorts of questions about your business.  He isn’t nosey, he is trying to do what is best for the company.  There are times in Endurance Racing where the team will drive to a pace, and times the team will go flat out.  In many ways the daily and weekly maintenance windows are these times.

In most business’ based in the US nothing can happen until after 9PM.  Us folks on the east coast have to take into account the pesky west-coasters and keep things running in ‘production mode’ for them until their business day is over.  Some business’ cater to end users who may not use their services until after the workday is over, which case, it is Midnight or later when the window opens.

The DBA will work hard at getting the backups, defrags, reorgs and any nightly roll ups for reporting done in this time.  The data must still be available but it is okay to be a bit.. slow.

The DBA also makes sure the databases are in tip-top shape and check for all sorts of issues using automated tasks and reports based upon data collection.  There will be charts about locks, blocks, CPU utilization, Disk Utilization, you may hear him utter things about split pages, buffer cache hit ratio or any number of other things in a language you may not understand.  This is normal!

This is the normal routine.  What happens when the car breaks, or even worse, the driver crashes it?

Professional pit crews plan for just about every possible scenario and practice each and every move many times until it is second nature.  Pit crews also know their car inside and out, and can diagnose issues fairly quickly, and for instances where something is truly broken beyond repair will have spares to use to get the car back into action as quickly as possible.

The same is true for a DBA.   We must know our data, our infrastructure, our applications and business’ so well that when something goes wrong (it always does) we can respond properly and quickly.

This brings me to one key point of a DBA’s design and development work.  High Availability.   High Availability or HA is simply a way to make sure that if one set of hardware were to break there is a second nearby to take over without much fuss.  I have built this many times over in my travels, and it works well for all sorts of issues both planned (patching) and unplanned (memory going bad in a server).  

Having the HA plan built and well documented is essential to the smooth operation of an IT Shop.  The DBA’s first responsibility here is to make it as seamless as possible so that the least number of people need to be involved in any failure.   What good is HA if the dev staff has to be dragged out of bed kicking and screaming to fix an issue.  Budget is also a huge part of this.  The DBA wants to do the best possible job, and may have the inclination to spend copious amounts of cash on this issue.  Trust me, he has the best interest of the company in mind.  Left to their own devices most good DBA’s don’t want to do work that isn’t needed.

After the HA plan is documented, approved and built, the DBA must test it and schedule a routine test of the plan to make sure it works.  This is nerve racking, because if it doesn’t work the DBA is on the spot to fix the issue.  This is the company’s data, and when the test fails, the company is most likely down.
The DBA must also practice for all sorts of other issues.  Practice restoring data, practice debugging or performance tuning.  Learn about all of the bits and pieces the database depends on (The Database Engine of course, also the OS, Hardware, networking and storage).  The DBA cannot possibly know all of this, but must keep a working knowledge of all of these subjects and have a close and good relationship with the people that do know. There has to be a trust there on both sides, a mutual respect, and if that is there then some healthy joking.  I mean hardware people are only good for plugging stuff in after all ;)  

A good DBA will have all of this going on in the back of his mind at all times.  Always keeping an eye on how things are going, always thinking about how things can go wrong and planning and practicing for every possibility.

Many have key scripts written, and instructions for most errors.  Many use Google to find things ( I know I do) that they don’t know.  DBAs like the boring life, and strive and work hard to have one.  When they are surprised the stress level builds.

When an emergency does happen please remember this.  The DBA will have a Director, VP and possibly a C-Level person in their office until things are fixed.  This is very stressful.  DBAs are often asked questions that they do not have the answer.  I mean, if I KNEW why it was broken right away, it would be fixed, or better yet would never have gotten to this state.  

DBA’s are stressed and are paid to be negative.  We are paid to think about every single way data can be made unavailable, corrupted or destroyed and then to formulate a plan to keep that from happening and fix it when (not if) it does happen.  I know I’m slipping into DR here, and I promise I will touch on that more deeply later, but remember this always.  The DBA has lots of important tasks to do, and if they do not jump on your query or answer your question straight away it isn’t because they don’t want to, it is because they are trying to keep the race car running, keep the valuable data from breaking down, crashing and fixing it when it does.

Stick with me, next time I’ll talk about how a DBA architect the data inside of applications for you people who have home grown apps, and how the DBA works with the infrastructure manager to build a high performance system using bits of technology that many don’t see.  

Until then, bring your DBA a donut and thank them for all the hard work they do.  Most DBA’s respond well to donuts and praise, and who knows maybe your query request will be done more quickly.  Not that I’m advocating bribery or flattery......

Labels: , , ,