Friday, September 18, 2009

SQL Server, Parallelism and Simple Queries

We all marveled when Martin wrote a seemingly simple query that simply brought our OLTP cluster to its knees. How on earth did he manage to spawn so many threads and peg all 8 cores of our massive new cluster server? We were all amazed at how JB was able to create a near perfect Square Wave pattern on all 8 cores using a much more complex query. We have seen queries that run quickly in Dev and yet grind to a halt on our supposedly fast cluster.

What could possibly cause all of these problems?

One word, Parallelism.

What, pray tell, is Parallelism?
Parallelism is when SQL server allows a single query to spawn multiple threads and have threads running on different CPUs.

What causes SQL Server to decide to use Parallelism?
The Query Optimizer looks at the statistics, configuration of the server and server load (to a degree) to decide to split up a query into many threads.

What causes this to be so slow?
The act of splitting up the query written to be executed serially, spawning threads, marshalling threads and then unifying the results takes time. Sometimes more time than it would to just allow the query to run on just one thread. I know the first sub-question that comes to mind is “Well if one CPU and one Thread is good, aren’t 8 CPU’s and 255 worker threads that much better?” The answer is… it depends.

Does Parallelism cause any other issues?
In a word, yes. Parallelism can allow even “Simple” queries to run away with the processor and not have enough cycles for the server’s primary function. On an OLTP system like ours, it is imperative that there is some processing power left over for the end user requests even in the midst of big queries and bulk activities. When all 8 cores are plugged up with one query it can be quite the challenge to have any other code run on that server at the same time.

What can be done about this?
In my travels I have seen several opinions on what to do about this problem. Of course opinions are like noses and everybody’s smells but I figured I would distill down the 3 major camps.

Camp 1: Your code is bad, Parallelism is good. Rework your code
While this camp makes a valid point, and yes we have code that is not optimum this is a rather unrealistic and flawed position. What IT shop has spare development cycles to go back in bulk and fix all of their code? AND seeing that this issue doesn’t creep up on the Dev server sometimes, regardless of configuration, it is an exercise in futility and time burning to attempt to do a bulk fix of all queries. Besides, fixing our code on the servers doesn’t solve a major issue on OLTP servers, which are ad-hoc queries that run away with the processors.

Camp 2: Parallelism is evil, turn it off
This camp has an equally valid point when their views are pointed towards an OLTP environment that sees no bulk activities and whose transaction volume is high, while data reads are low. This point is equally flawed however. Why? Because on a general purpose server, even of the OLTP type, there are instances where Parallelism can prove to be beneficial and actually improve performance.

Camp 3: Parallelism is like all tools and can be used for good and evil
This camp has a pragmatic approach to the problem, while in some cases Parallelism is the bees knees, best thing since sliced bread and all that, it can also be the bane of the servers existence. This camp seeks to utilize Parallelism, limit it to only queries that actually need it, and mitigate the risk of an ad-hoc query runs away. This DBA also knows it is too much to ask developers to be on their best behavior and not write ad-hoc queries with hints in them to limit their damage.

So it is clear I belong to camp 3. I’ll be transparent. But how does one allow the potentially damaging and evil Parallelism to co-exist with queries that shouldn’t utilize it AND with OLTP functions that shouldn’t be interrupted?

How to manage Parallelism
There are 3 tools in the toolbox that can be deployed to try to bring balance to the Server.

First, tweak the cost threshold of parallelism. Default from Microsoft this is set to 5 seconds, meaning that any query that executes over 5 seconds is game for parallelism. I have already upped this number to 10 based upon a suggestion on MSDN, and am looking into boosting it further to 15. What this value does is allows me to limit the smaller queries from using Parallelism without disabling it for long running queries.

Second, tweak the Max degree of parallelism. This option is set to 0 as a default from Microsoft. What that means is that any single query is allowed to spawn threads and use all cores on the server. As we have an 8 core server that means a single query can access all 8 cores.
There is one more tweak that could be made, however it is not recommended. A tool can limit the Max Worker Threads. The default is 255 and there have been rare instances where limiting this number has proved to be beneficial, however it is not recommended.

Course of action

Monitor the Threshold cost and tweak where needed.
This can be done with system tools in SQL Server. Limit the number of cores a single query can use. I suggest using allowing 25%-50% (2 or 4 cores) to allow for OLTP transactions always getting through. However, even if the limit is set to 2, only 4 ‘simple queries’ need to be run to bring down the server. At this point we only need 1.

So 2 cores means 4 simple queries, 4 cores means we can handle 2 nasty queries.

Our server nominally uses between 10-20% of its total processing power at any given time during the day. There are bursts of batch, bulk, and other traffic that see spikes up to 50-70% and the odd 100% flood that comes from a few sources.

Each core is 12.5% of the total processing power of the server. Based upon that, I would say that max degree of parallelism be set to 4, meaning that at most a query can consume 4 CPU cores 100% and leave the other 4 for normal processing. This isn’t the most protected position, and leaves the server vulnerable to not being able to handle 2 big queries. Conversely this change could make some very complex queries to take longer due to less processing power being assigned to them.

Side note. The server MAXDOP setting can be overridden in individual queries using a MAXDOP hint. Think of this is the “Nice” feature in SQL Server. It works both ways, you can force a query to consider using more cores than the limit if you are sure it will not cause major issues, however this should be the exception not the rule.

I also thought it would be nice to include a little code....
Adjust cost threshold

sp_configure 'show advanced options', 1;
GO
reconfigure;
GO

sp_configure 'cost threshold for parallelism', 10;
GO
reconfigure;
GO


Adjust Max degree of parallelism

sp_configure 'show advanced options', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO
sp_configure 'max degree of parallelism', 4;
GO
RECONFIGURE WITH OVERRIDE;
GO

manually override via query hint.
SELECT stuff

FROM table

GROUP BY columns

ORDER BY column

OPTION ( MAXDOP 1 )

Labels: , , , ,

Tuesday, September 15, 2009

I've lost my memory.......SQL Server VAS memory fragmentation issue

SQL Server VAS memory fragmentation issue
The SQL Server cluster is having regular backup issues with Red Gate Backup reporting a VDI Error. After extensive research the error is because of a memory fragmentation issue.
Long story short VDI (Microsoft’s Virtual Device Interface) is used by pretty much all 3rd party backup utilities to back up databases. VDI is only allowed to use a very narrow portion of memory known as VAS memory (Virtual Address Space) also known as MemToLeave. The default amount of VAS Space allocated for SQL Server is between 128 MB and 256 MB depending…. I’ve found that number varies from source to source.
There are 4 major contributors to VAS memory issues in regards to SQL Server
1) Extended Stored Procedures with memory leaks
2) Use of linked servers
3) COM integration via Stored Procedures
4) Complex Queries that generate Execution plans of greater than 8K and cause paging
How does this impact us?
Glad you asked. What is happening is that our VAS memory is being fragmented. Why is this fragmentation a problem? VDI requires several blocks of specific sizes of contiguous memory in order to buffer the data and perform its task. In this case backing up, compressing and encrypting data.

Steps to solve:
1) Use the –g startup parameter for SQL Server and up the VAS memory for SQL to 512MB. This will give us a bigger area to play in and hopefully more contiguous memory blocks. This change is recommended by Microsoft for Database Servers that have over 4GB of RAM and 75+ databases. We have 32 GB of RAM and over 50 databases so it is a needed change soon anyway
2) Investigate ways to eliminate the one XP I know we are using. This is an SMTP XP used for alerting. While it has been in use for years and has had no negative impact I have no doubt my friends at Microsoft would ask I remove it… just to rule it out as a problem.
3) Verify that the only place I can find that is using COM in a SP has been commented out. I had to write a handy dandy script to ferret that sucker out, I may share it in the future.
4) Reduce amount of queries that attach to the cluster via Linked Server these come from two places
1) Reports on report server that simply call a SP on the cluster
2) old SP code that references Linked Servers on the cluster its self
5) Code review any Stored Procedures released on the cluster within the past couple of weeks to see if we can simplify any of the execution plans. Why only the last couple? Because this is a change in how the cluster was working. We never had any of these issues before, so what changed? I’m guessing our intern released some nasty code on the sly…

This is a fairly recent issue as the cluster worked for a number of weeks without any issues at all. This has really crept in during the last 2 weeks.
Now I could go into the depth of detail that some grumpy DBA’s and other MVP’s have gone into, but really, that data is out there. This is just my experience with this issue, and because all of that data is so easily available and complex (and I stand to get it wrong in my translation) I’ll just let google do the walking for you!

Labels: , , ,