Thursday, September 14, 2017

Database Programming: Game, Set, Match — Why Writing Code In Transact-SQL Is Different Than Writing In A Compiled Language

More tidbits from an internal discussion..  it'll take us awhile to get to the answer to the titular question; hopefully you'll find this a worthy journey.

Mike asked, in a nutshell, why this batch is taking so long to run:


SET NOCOUNT ON
DECLARE @Rows AS INTEGER
Set @Rows=1
WHILE @Rows < 1000000
BEGIN
    
INSERT INTO SomeTable values (NEWID())    
    SET @Rows = @Rows + 1
END

Mike saw this batch taking over 15 minutes to run on his single-proc sandbox; while he was expecting it to be I/O bound, Profiler reported low I/O rates and high CPU utilization.

Several folks on the alias ran some benchmarks; I'll reproduce mine here.

The first question was: is this a flow of control issue?  Mike's syntax above ran in 5:08 (five minutes, eight seconds) on my sandbox (it appears, at minimum, that Mike needs a new sandbox).  In order to run this test, I simply commented out the INSERT statement:


SET NOCOUNT ON
DECLARE
 @Rows AS 
INTEGER
Set
 @Rows=
1
WHILE @Rows < 1000000
BEGIN
    --  INSERT INTO SomeTable values (NEWID())    
    SET @Rows = @Rows + 1
END

This syntax ran in 0:02 on my sandbox (single 3.2gHz CPU (no HT); 1gB RAM).  So, flow of control isn't the issue.

My next test was to wrap Mike's existing code in a transaction:


SET NOCOUNT ON
DECLARE
 @Rows AS 
INTEGER
Set
 @Rows=
1
BEGIN TRAN
WHILE
 @Rows <
 1000000
BEGIN
    
INSERT INTO SomeTable values (NEWID
())    
    SET @Rows = @Rows + 1
END
COMMIT
 TRAN

This code ran in in 0:44, which suggests to me that a good deal of the overhead here is from transaction control, so the lesson here is that a single 1M row transaction is cheaper than 1M single-row transactions.  This makes sense intuitively, as we're saving 999,999 COMMITs.

I then wondered what impact the function call was having on our scenario, so I replaced the NEWID() call with a literal:


SET NOCOUNT ON
DECLARE
 @Rows AS 
INTEGER
Set
 @Rows=
1
BEGIN TRAN
WHILE
 @Rows <
 1000000
BEGIN    
    INSERT INTO SomeTable values ('1')
    SET @Rows = @Rows + 1
END
COMMIT
 TRAN

This code ran in 0:36, which suggests that the function call has some overhead, which seems intuitively reasonable.

At this point, along came Shaun with a technique discovered in Dr. SQL's blog (and originated by Joe Celko):


with digits(i) as (
select 1 as i 
union all 
select 2 
union all 
select 3 
union all
select 4 
union all 
select 5 
union all 
select 6 
union all
select 7 
union all 
select 8 
union all 
select 9 
union all
select 10
),
-- generate 1M rows each with a unique row number, i
sequence(i) as 
(
select d1.i + (10*d2.i) + (100*d3.i) + (1000*d4.i) + (10000*d5.i) + (100000*d6.i)
from digits as d1, digits as d2, digits as d3, digits as d4, digits as d5, digits as d6
)
insert into SomeTable
select newid() as uuid
from sequence


Well, folks, this little nugget ran in 0:03 on my sandbox.  Three seconds!  One-twentieth of a minute.

NOW we're finally ready to answer the titular question, because this insight provides a stark example of why programming in T-SQL is different than a typical 3GL or 4GL.

nGLs are procedural animals.

T-SQL is a set-based animal.

This example shows that, in T-SQL, no matter what sort of gyrations you go through to generate a set of data, you'll get better performance working in a set-based metaphor than you will in a procedural metaphor.

By building a set of one million unique integers rather than counting to one million one-at-a-time procedurally, we introduce an order-of-magnitude improvement over our best procedural alternative.  Shaun's CTE avoids 2M statement executions (1M INSERTs and 1M SETs).

To a T-SQL programmer who groks set-based thinking, Shaun's syntax has that special sparkle of brilliance about it.  To a practitioner of procedural coding, the approach might seem bizarre, or at the least contrived.

Well, back last week when we were talking about FOR XML EXPLICIT syntax, I wrote:


It seems to me that any syntax that produces the desired result with a close-to-optimal query plan is "proper", and that any further fine tuning devolves quickly into the relative merits of personal style.

Not only do I not find Shaun's code contrived in the least, when I filter it through the statement above I find it to be the one true path in this scenario in that it leverages T-SQL's strength in handling sets of data.  This query plan is way more optimal than any of those we generated procedurally, and therein lies the rub.

For my money, this is why good nGL programmers (generally) make bad T-SQL programmers, and vice-versa.  It's very difficult (painful at times in my case) to veer between those two worlds.

So, since this is, after all, a SQL Server blog, I say (with a respectful tip of the hat to our procedural brethren), "Long live the set-based thinkers!"

Do you think you can get 1M distinct GUIDs into a table in under three seconds?  Drop me a line with your syntax, or any other comments you might have.

No comments: