October 05, 2009 05:05 PM

Solutions to T-SQL Challenge - Efficient Partitioned TOP

Rating: (0)
SQL Server Magazine
InstantDoc ID #102909

Last week I provided a T-SQL challenge entitled “Efficient Partitioned TOP” involving finding an efficient solution to a given task when you cannot create new indexes on the queried table. You can find the puzzle details here. I gave code to populate the queried table T1 with 10,000,000 rows of sample data. I provided the following solution:

WITH C AS

(

  SELECT grp, col1, col2,

    ROW_NUMBER() OVER(PARTITION BY grp ORDER BY col1 DESC, col2) AS rownum

  FROM dbo.T1

)

SELECT grp, col1, col2

FROM C

WHERE rownum = 1;

 

When there’s no index on (grp, col1 DESC, col2) the execution plan for this solution involves sorting the 10,000,000 rows, which is very expensive. Here are the performance measures I got for this solution on my test machine:

Elapsed time: 83221, CPU: 120167, Logical Reads: 42472

The data is scanned once, but the sort operation contributes to the majority of the cost of the query. The challenge was to find a more efficient solution assuming creating a new index is not an option.

Thanks to all those who submitted solutions: Peter Larsson, Plamen Ratchev, Steve Kass, MuadDib, Rudolf, Paula North, Rob Farley, Regan Wick, Adriaan Simons, and cwestervelt.

Most people (Peter Larsson, Plamen Ratchev, MuadDib, Rudolf, Paula North, Regan Wick, Adriaan Simons, cwestervelt) submitted a pretty straightforward solution based on grouping, joining, and then grouping again:

SELECT A.grp, A.col1, MIN(B.col2) AS col2

FROM ( SELECT grp,

         MAX(col1) AS col1

       FROM  dbo.T1

       GROUP BY grp ) AS A

  JOIN dbo.T1 AS B

    ON B.grp = A.grp

   AND B.col1 = A.col1

GROUP BY A.grp, A.col1;

 

The performance measures I got for this solution are:

Elapsed time: 30798, CPU: 52961, Logical Reads: 84584

The plan for this solution shows that the data is scanned twice, but still it is substantially faster than the solution based on the ROW_NUMBER function because it doesn’t involve sorting. Both the aggregations and the join are handled with hashing. With such a large number of rows, handling the aggregations with hashing is more efficient than sorting, albeit the extra scan of the data.

Peter Larsson also provided a couple of other solutions based on similar logic. One using the APPLY operator:

SELECT A.grp, A.col1, B.col2

FROM ( SELECT grp,

         MAX(col1) AS col1

       FROM  dbo.T1

       GROUP BY grp ) AS A

  CROSS APPLY ( SELECT     MIN(C.col2) AS col2

                       FROM dbo.T1 AS C

                       WHERE    C.grp = A.grp

                            AND C.col1 = A.col1) AS B;

 

Elapsed time: 29311, CPU: 52619, Logical Reads: 84540

And another using correlated subqueries with aggregate functions:

SELECT DISTINCT A.grp, A.col1, A.col2

FROM dbo.T1 AS A

WHERE A.col1 = (SELECT MAX(col1)

                FROM dbo.T1 AS B

                WHERE B.grp = A.grp)

  AND A.col2 = (SELECT MIN(col2)

                FROM DBO.T1 AS C

                WHERE C.grp = A.grp AND c.col1 = a.col1);

 

Elapsed time: 32623, CPU: 60185, Logical Reads: 126864

Steve Kass, Rob Farley and I used a completely different approach to solving the problem. You use some method to concatenate the values of col1 and col2 after some manipulation that inverses the comparison/ordering behavior of col2. You then group the rows by grp and calculate the maximum of the concatenated values (call the result mx). Finally, you extract the two elements from mx, and apply manipulation than inverses the part representing col2 back.

Steve and Rob used numeric manipulation to handle the math, and character manipulation to handle the concatenation. Here’s Steve’s solution:

WITH C AS

(

  SELECT grp,

    MAX( CAST(5000000000.+col1 AS CHAR(10))

       + CAST(5000000000.-col2 AS CHAR(10)) ) AS mx

  FROM dbo.T1

  GROUP BY grp

)

SELECT grp,

  CAST(CAST(LEFT(mx,10) AS DECIMAL(10,0))-5000000000. AS INT) AS col1,

  CAST(5000000000.-CAST(RIGHT(mx,10) AS DECIMAL(10,0)) AS INT) AS col2

FROM C;

 

The plan for this solution involves only one scan of the data, and a hash aggregate. But due to the length and type of the aggregated element, the hash aggregate in this solution is more expensive than the hash aggregates used in the previous solutions. Here are the performance measures I got for this solution:

Elapsed time: 30906, CPU: 56597, Logical Reads: 42336

I used binary manipulation to produce a single binary string representing the concatenated col1 value and the inversed col2 value. I also added prefixes to the values to make sure that their comparison/ordering behavior will be in accord with the integer values. Here’s the solution:

WITH C AS

(

  SELECT grp,

    MAX( CASE WHEN col1 >= 0 THEN 0x01 ELSE 0x00 END + CAST(col1 AS BINARY(4))

       + CASE WHEN col2 >= 0 THEN 0x00 ELSE 0x01 END + CAST(~col2 AS BINARY(4)) ) AS mx

  FROM dbo.T1

  GROUP BY grp

)

SELECT grp,

  CAST(SUBSTRING(mx, 2, 4) AS INT) AS col1,

  ~CAST(SUBSTRING(mx, 7, 4) AS INT) AS col2

FROM C;

 

The concatenated string is shorter than in the previous solution, plus the aggregation of binary values is less expensive than aggregation of character values. Therefore, this solution is faster than the previous one. Here are the performance measures I got for it:

Elapsed time: 22183, CPU: 41089, Logical Reads: 42312

But this solution relies on the physical representation of the data—specifically the fact that SQL Server uses twos complement as the storage format for integers. The previous solution is not dependant on physical representation of the data.

Note that all performance measures I presented here are ones I got on my test machine, which has two CPU cores, 4 GB RAM, and a single hard drive. On my test machine, the last solution based on the binary concatenation technique was the fastest. Others who ran the solutions on their hardware reported different measures with different ratios. All got very good performance for the concatenation techniques, but for some, other solutions got pretty close, and in cases even slightly better performance.

Cheers,

Itzik

 

Add a Comment

My query had some unnecessary complexity. Please forgive the additional post with a cleaned-up version.

WITH bin AS (
SELECT
Grp,
D =
Max(
Convert(binary(4), CASE WHEN Col1 >= 0 THEN Col1 ^ 0x80000000 ELSE Col1 & 0x7FFFFFFF END)
+ Convert(binary(4), CASE WHEN Col2 >= 0 THEN (-Col2 - 1) & 0x7FFFFFFF ELSE -(Col2 + 1) ^ 0x80000000 END)
)
FROM dbo.T1
GROUP BY Grp
)
SELECT
Grp,
Col1 =
CASE
WHEN Substring(D, 1, 4) < 0x80000000 THEN Convert(int, Substring(D, 1, 8)) ^ 0x80000000
ELSE Convert(int, Substring(D, 1, 4)) & 0x7FFFFFFF
END,
Col2 =
CASE
WHEN Substring(D, 5, 4) < 0x80000000 THEN -(Convert(int, Substring(D, 5, 4)) ^ 0x80000000 + 1)
ELSE -(Convert(int, Substring(D, 5, 4)) & 0x7FFFFFFF) - 1
END
FROM bin

Erik11/12/2009 6:20:38 PM


sd2k01, that query totally bombed on my machine because it ends up doing 3 index scans and a very expensive Index Spool (Eager Spool)

william, nice solution. For me, it performs best of all the given queries.

Erik11/12/2009 6:20:34 PM


gosh... is code formatting possible in comments? What an ugly blob.

Erik11/12/2009 6:10:15 PM


Itzik,

I went down the binary path too (I didn't look at the answers before trying the puzzle) and got a much more cumbersome query that seems to outperform yours a little, at least on my machine. I have to say I admire your prepending of 0x01 and 0x00 to gain correct ordering. I was stuck in a "fully reversible" box. Your query is superior in my opinion because it's so much simpler, but I still have to show you this:

WITH bin AS (
SELECT
Grp,
D =
Max(
Convert(binary(8),
CASE WHEN Col1 >= 0 THEN Convert(binary(4), Col1 ^ 0x80000000) ELSE Convert(binary(4), Col1 & 0x7FFFFFFF) END
+ CASE WHEN Col2 >= 0 THEN Convert(binary(4), (-Col2 - 1) & 0x7FFFFFFF) ELSE Convert(binary(4), -(Col2 + 1) ^ 0x80000000) END
)
)
FROM dbo.T1
GROUP BY Grp
)
SELECT
Grp,
Col1 =
CASE
WHEN Convert(int, Substring(D, 1, 4)) >= 0 THEN Convert(int, Substring(D, 1, 8)) ^ 0x80000000
ELSE Convert(int, Substring(D, 1, 4)) & 0x7FFFFFFF
END,
Col2 =
CASE
WHEN Convert(int, Substring(D, 5, 4)) >= 0 THEN -(Convert(int, Substring(D, 5, 4)) ^ 0x80000000 + 1)
ELSE -(Convert(int, Substring(D, 5, 4)) & 0x7FFFFFFF) - 1
END
FROM bin

Erik11/12/2009 5:52:30 PM


The below behaves well, on my PC at least. Admittedly not _a_ query, and uses #temp tables which might be objectionable to some, but performs well.

SELECT grp, MAX(col1) AS MC1
INTO #temp
FROM dbo.T1
GROUP BY grp

SELECT TT.grp, TT.col1, MIN(TT.col2)
FROM dbo.T1 TT
INNER JOIN #temp T
ON T.grp
= TT.grp
AND T.MC1
= TT.col1
GROUP BY TT.grp, TT.col1
ORDER BY TT.grp, TT.col1

William10/31/2009 1:28:33 AM


You can cheat by relying on the randomness of the data inserted into the table an put "WHERE col1 > 95" at the end of the inner query - limiting the size of the sort. Not recommended for production systems...! :)

William10/31/2009 12:38:04 AM


Here's my solution which took 15 seconds to run on my Laptop...

;With MaxMin as
(
Select distinct grp
, (select MAX(col1) from dbo.T1 b where b.grp = a.grp group by b.grp) maxcol1
from dbo.T1 a
group by grp
)
select MaxMin.grp
, maxcol1 As MaxCol1
, MIN(col2) As MinCol2
from MaxMin, dbo.T1
where MaxMin.grp = T1.grp
and MaxMin.Maxcol1 = T1.col1
group by MaxMin.grp, Maxcol1

saurabh10/8/2009 6:19:08 AM


You must log on before posting a comment.

Are you a new visitor? Register Here
GOOGLE LINKS
SPONSORED LINKS
FEATURED LINKS