Wednesday, May 02, 2018

(Mis) adventures in Always On Availability groups

(Mis) adventures in Always On Availability groups
As almost a part 2 to my last post I have done more AOAG testing.  To say that I’m a little shocked at the results may well be the understatement of the hour.

As I wrote before, I had something strange happen with a customer’s AOAG setup that left a small subset of their databases unable to sync up with the primary replica.  Since this was in production I had precious little I could do to rectify the situation.  I did find a somewhat satisfactory fix for the issue, but the question lingered, how delicate is Always on really?

I set up a fresh AOAG cluster Windows 2016 and SQL 2016.  I setup the server to a standard install for my company, got the AOAG setup but failed to setup quorum (on purpose)   as that is how I inherited the customer cluster that I first saw the problem. 

First I disabled the NIC on the secondary replica.  The primary kept on trucking for a while I did some inserts, updates and even schema changes for good measure.  Then I enabled the NIC.  The Secondary Replica struggled, sputtered and then came to life without any real intervention.  That was good.

I then went for the gusto, a former colleague of mine informed me when he did this test it killed the AOAG, so I tried…..  And disabled the primary replica’s NIC.

I wish there were fireworks, or explosions or something more noteable to denote what happened… but alas, all that happpend is that the primary kept chugging, and the secondary replica…. Died.    The DB entered a “Not Synchronizing Recovery Pending” state.  And stayed there for the better part of an hour before I intervened. 

I decided to just delete the AG.  big fat mistake.  That took the primary database to a “Restoring” point.  Oops.

Then I tried the entire test all over.  Prior to setting up the AG I set the quorum up properly.  Repeated the test with both replicas in Synchronous mode (as they were before)  and this time.. Bam, it simpy failed over as expected…  then failed back when I ran the test again, and again, and again, and again and (I think you get the point here).

AOAG is okay.  That said I’ve never had major issues with Mirroring or log shipping.  The one thing I wished they had out of the box was easy failover/failback… but that is easily enough scripted out.

Would I jump ship and upgrade just to get AOAG?  No.  That said, Mirroring is deprecated so when the normal upgrade cycle for your environment comes up that will be the time to move.

Thanks Microsoft...

Labels: , ,