Friday, September 18, 2009

SQL Server, Parallelism and Simple Queries

We all marveled when Martin wrote a seemingly simple query that simply brought our OLTP cluster to its knees. How on earth did he manage to spawn so many threads and peg all 8 cores of our massive new cluster server? We were all amazed at how JB was able to create a near perfect Square Wave pattern on all 8 cores using a much more complex query. We have seen queries that run quickly in Dev and yet grind to a halt on our supposedly fast cluster.

What could possibly cause all of these problems?

One word, Parallelism.

What, pray tell, is Parallelism?
Parallelism is when SQL server allows a single query to spawn multiple threads and have threads running on different CPUs.

What causes SQL Server to decide to use Parallelism?
The Query Optimizer looks at the statistics, configuration of the server and server load (to a degree) to decide to split up a query into many threads.

What causes this to be so slow?
The act of splitting up the query written to be executed serially, spawning threads, marshalling threads and then unifying the results takes time. Sometimes more time than it would to just allow the query to run on just one thread. I know the first sub-question that comes to mind is “Well if one CPU and one Thread is good, aren’t 8 CPU’s and 255 worker threads that much better?” The answer is… it depends.

Does Parallelism cause any other issues?
In a word, yes. Parallelism can allow even “Simple” queries to run away with the processor and not have enough cycles for the server’s primary function. On an OLTP system like ours, it is imperative that there is some processing power left over for the end user requests even in the midst of big queries and bulk activities. When all 8 cores are plugged up with one query it can be quite the challenge to have any other code run on that server at the same time.

What can be done about this?
In my travels I have seen several opinions on what to do about this problem. Of course opinions are like noses and everybody’s smells but I figured I would distill down the 3 major camps.

Camp 1: Your code is bad, Parallelism is good. Rework your code
While this camp makes a valid point, and yes we have code that is not optimum this is a rather unrealistic and flawed position. What IT shop has spare development cycles to go back in bulk and fix all of their code? AND seeing that this issue doesn’t creep up on the Dev server sometimes, regardless of configuration, it is an exercise in futility and time burning to attempt to do a bulk fix of all queries. Besides, fixing our code on the servers doesn’t solve a major issue on OLTP servers, which are ad-hoc queries that run away with the processors.

Camp 2: Parallelism is evil, turn it off
This camp has an equally valid point when their views are pointed towards an OLTP environment that sees no bulk activities and whose transaction volume is high, while data reads are low. This point is equally flawed however. Why? Because on a general purpose server, even of the OLTP type, there are instances where Parallelism can prove to be beneficial and actually improve performance.

Camp 3: Parallelism is like all tools and can be used for good and evil
This camp has a pragmatic approach to the problem, while in some cases Parallelism is the bees knees, best thing since sliced bread and all that, it can also be the bane of the servers existence. This camp seeks to utilize Parallelism, limit it to only queries that actually need it, and mitigate the risk of an ad-hoc query runs away. This DBA also knows it is too much to ask developers to be on their best behavior and not write ad-hoc queries with hints in them to limit their damage.

So it is clear I belong to camp 3. I’ll be transparent. But how does one allow the potentially damaging and evil Parallelism to co-exist with queries that shouldn’t utilize it AND with OLTP functions that shouldn’t be interrupted?

How to manage Parallelism
There are 3 tools in the toolbox that can be deployed to try to bring balance to the Server.

First, tweak the cost threshold of parallelism. Default from Microsoft this is set to 5 seconds, meaning that any query that executes over 5 seconds is game for parallelism. I have already upped this number to 10 based upon a suggestion on MSDN, and am looking into boosting it further to 15. What this value does is allows me to limit the smaller queries from using Parallelism without disabling it for long running queries.

Second, tweak the Max degree of parallelism. This option is set to 0 as a default from Microsoft. What that means is that any single query is allowed to spawn threads and use all cores on the server. As we have an 8 core server that means a single query can access all 8 cores.
There is one more tweak that could be made, however it is not recommended. A tool can limit the Max Worker Threads. The default is 255 and there have been rare instances where limiting this number has proved to be beneficial, however it is not recommended.

Course of action

Monitor the Threshold cost and tweak where needed.
This can be done with system tools in SQL Server. Limit the number of cores a single query can use. I suggest using allowing 25%-50% (2 or 4 cores) to allow for OLTP transactions always getting through. However, even if the limit is set to 2, only 4 ‘simple queries’ need to be run to bring down the server. At this point we only need 1.

So 2 cores means 4 simple queries, 4 cores means we can handle 2 nasty queries.

Our server nominally uses between 10-20% of its total processing power at any given time during the day. There are bursts of batch, bulk, and other traffic that see spikes up to 50-70% and the odd 100% flood that comes from a few sources.

Each core is 12.5% of the total processing power of the server. Based upon that, I would say that max degree of parallelism be set to 4, meaning that at most a query can consume 4 CPU cores 100% and leave the other 4 for normal processing. This isn’t the most protected position, and leaves the server vulnerable to not being able to handle 2 big queries. Conversely this change could make some very complex queries to take longer due to less processing power being assigned to them.

Side note. The server MAXDOP setting can be overridden in individual queries using a MAXDOP hint. Think of this is the “Nice” feature in SQL Server. It works both ways, you can force a query to consider using more cores than the limit if you are sure it will not cause major issues, however this should be the exception not the rule.

I also thought it would be nice to include a little code....
Adjust cost threshold

sp_configure 'show advanced options', 1;
GO
reconfigure;
GO

sp_configure 'cost threshold for parallelism', 10;
GO
reconfigure;
GO


Adjust Max degree of parallelism

sp_configure 'show advanced options', 1;
GO
RECONFIGURE WITH OVERRIDE;
GO
sp_configure 'max degree of parallelism', 4;
GO
RECONFIGURE WITH OVERRIDE;
GO

manually override via query hint.
SELECT stuff

FROM table

GROUP BY columns

ORDER BY column

OPTION ( MAXDOP 1 )

Labels: , , , ,

0 Comments:

Post a Comment

<< Home