Wednesday, May 06, 2009

More SQL 2008 Clustering

So it has been a while since I've posted because, well, I had little to report. I have discovered a couple odd little things that the upgrade adviser missed when upgrading from a SQL 2000 to SQL 2008 server, but nothing to write home about.

Then we had the oddest intermittent problem. Every so often (about every 2 weeks) first one node of the cluster, then the second node would fail while trying to RDC into the servers. It was odd because the File System was still serving up files, and the SQL Server was still serving up data.

I uncovered this because I was having odd behavior using SSMS and editing Linked Servers. (Yeah, ick, linked servers I inherited them and have a couple thousand Stored Procs to remove them from...).


So I decided to RDC into the cluster to see if I used the tools on the cluster if I would get the same error. Instead I was unable to login. I had the network admin, mordac, try to login with his account, then we both tried the local admin, and finally I had him use the IPKVM and attach to the console.. No Dice.

I figured it was a fluke.... so we rebooted, took a look at the logs... there was nothing in there. Zip, Zilch, Nada. So, I did some googling and MSDN searches. Nothing. OK, so we let it slide, tried to monitor the box better and kept on trucking.

2 weeks later, after RDCing into the boxes every morning, same issue. This time when we brought the server back up the logs were corrupted. Nothing to see here, had lots of work to do, so we waited.

2 weeks later it happened again. No logs. So we called Microsoft. They hadn't a clue. They looked at our configuration, and our network, our Clustering and logs. Nothing. I was instructed to capture perfmon data, so I setup the perfmon with a 250MB limit per their instructions and we waited.

We know it isn't load induced, because we put huge loads on the active node via SQL and via a loading 'burn in' tool.. and the passive node bombed... but only after a weekend of no activity.

Good news, we got perfmon data. Bad news, the circular filter failed and I got 12.8 GB of data. Too much, not gonna work.

So, we are back at square one. I've burned the servers down and we are reinstalling. I have a single node up, in clustered mode, with DTC and SQL Server.

I still had to repair the stinkin SQL install... I need to know what I did wrong. Perhaps I should add DTC as my SQL instance, then install SQL into the DTC. I'll try that on my next cluster which is just down the road and let you all know how poorly that went.

OH that one minor issue we had that was interesting.
From a SQL 2000 server, when running a query via a linked server to the SQL 2008 cluster I would get an error

OLE DB provider 'MSSQL' returned an unexpected data length for the fixed-length column '[DATABASE]..[NAME].[TABLE].FIELD'

It took a while, but there were three fixes. The first was a hotfix to the SQL 2000 server, that wasn't an option as I couldn't get it to install on my test server. The second was a query hint that also failed.

HOWEVER by setting DBCC TRACEON(8765)on the 2000 server that is running the query against the 2008 box all is well, no code changes needed and we are set to fly.

Labels: , , , ,