• subscribe
October 05, 2009 12:00 AM

Solutions to T-SQL Challenge - Efficient Partitioned TOP

SQL Server Pro
InstantDoc ID #102909

Last week I provided a T-SQL challenge entitled “Efficient Partitioned TOP” involving finding an efficient solution to a given task when you cannot create new indexes on the queried table. You can find the puzzle details here. I gave code to populate the queried table T1 with 10,000,000 rows of sample data. I provided the following solution:

WITH C AS

(

  SELECT grp, col1, col2,

    ROW_NUMBER() OVER(PARTITION BY grp ORDER BY col1 DESC, col2) AS rownum

  FROM dbo.T1

)

SELECT grp, col1, col2

FROM C

WHERE rownum = 1;

 

When there’s no index on (grp, col1 DESC, col2) the execution plan for this solution involves sorting the 10,000,000 rows, which is very expensive. Here are the performance measures I got for this solution on my test machine:

Elapsed time: 83221, CPU: 120167, Logical Reads: 42472

The data is scanned once, but the sort operation contributes to the majority of the cost of the query. The challenge was to find a more efficient solution assuming creating a new index is not an option.

Thanks to all those who submitted solutions: Peter Larsson, Plamen Ratchev, Steve Kass, MuadDib, Rudolf, Paula North, Rob Farley, Regan Wick, Adriaan Simons, and cwestervelt.

Most people (Peter Larsson, Plamen Ratchev, MuadDib, Rudolf, Paula North, Regan Wick, Adriaan Simons, cwestervelt) submitted a pretty straightforward solution based on grouping, joining, and then grouping again:

SELECT A.grp, A.col1, MIN(B.col2) AS col2

FROM ( SELECT grp,

         MAX(col1) AS col1

       FROM  dbo.T1

       GROUP BY grp ) AS A

  JOIN dbo.T1 AS B

    ON B.grp = A.grp

   AND B.col1 = A.col1

GROUP BY A.grp, A.col1;

 

The performance measures I got for this solution are:

Elapsed time: 30798, CPU: 52961, Logical Reads: 84584

The plan for this solution shows that the data is scanned twice, but still it is substantially faster than the solution based on the ROW_NUMBER function because it doesn’t involve sorting. Both the aggregations and the join are handled with hashing. With such a large number of rows, handling the aggregations with hashing is more efficient than sorting, albeit the extra scan of the data.

Peter Larsson also provided a couple of other solutions based on similar logic. One using the APPLY operator:

SELECT A.grp, A.col1, B.col2

FROM ( SELECT grp,

         MAX(col1) AS col1

       FROM  dbo.T1

       GROUP BY grp ) AS A

  CROSS APPLY ( SELECT     MIN(C.col2) AS col2

                       FROM dbo.T1 AS C

                       WHERE    C.grp = A.grp

                            AND C.col1 = A.col1) AS B;

 

Elapsed time: 29311, CPU: 52619, Logical Reads: 84540

And another using correlated subqueries with aggregate functions:

SELECT DISTINCT A.grp, A.col1, A.col2

FROM dbo.T1 AS A

WHERE A.col1 = (SELECT MAX(col1)

                FROM dbo.T1 AS B

                WHERE B.grp = A.grp)

  AND A.col2 = (SELECT MIN(col2)

                FROM DBO.T1 AS C

                WHERE C.grp = A.grp AND c.col1 = a.col1);

 

Elapsed time: 32623, CPU: 60185, Logical Reads: 126864

Steve Kass, Rob Farley and I used a completely different approach to solving the problem. You use some method to concatenate the values of col1 and col2 after some manipulation that inverses the comparison/ordering behavior of col2. You then group the rows by grp and calculate the maximum of the concatenated values (call the result mx). Finally, you extract the two elements from mx, and apply manipulation than inverses the part representing col2 back.

Steve and Rob used numeric manipulation to handle the math, and character manipulation to handle the concatenation. Here’s Steve’s solution:

WITH C AS

(

  SELECT grp,

    MAX( CAST(5000000000.+col1 AS CHAR(10))

       + CAST(5000000000.-col2 AS CHAR(10)) ) AS mx

  FROM dbo.T1

  GROUP BY grp

)

SELECT grp,

  CAST(CAST(LEFT(mx,10) AS DECIMAL(10,0))-5000000000. AS INT) AS col1,

  CAST(5000000000.-CAST(RIGHT(mx,10) AS DECIMAL(10,0)) AS INT) AS col2

FROM C;

 

The plan for this solution involves only one scan of the data, and a hash aggregate. But due to the length and type of the aggregated element, the hash aggregate in this solution is more expensive than the hash aggregates used in the previous solutions. Here are the performance measures I got for this solution:

Elapsed time: 30906, CPU: 56597, Logical Reads: 42336

I used binary manipulation to produce a single binary string representing the concatenated col1 value and the inversed col2 value. I also added prefixes to the values to make sure that their comparison/ordering behavior will be in accord with the integer values. Here’s the solution:

WITH C AS

(

  SELECT grp,

    MAX( CASE WHEN col1 >= 0 THEN 0x01 ELSE 0x00 END + CAST(col1 AS BINARY(4))

       + CASE WHEN col2 >= 0 THEN 0x00 ELSE 0x01 END + CAST(~col2 AS BINARY(4)) ) AS mx

  FROM dbo.T1

  GROUP BY grp

)

SELECT grp,

  CAST(SUBSTRING(mx, 2, 4) AS INT) AS col1,

  ~CAST(SUBSTRING(mx, 7, 4) AS INT) AS col2

FROM C;

 

The concatenated string is shorter than in the previous solution, plus the aggregation of binary values is less expensive than aggregation of character values. Therefore, this solution is faster than the previous one. Here are the performance measures I got for it:

Elapsed time: 22183, CPU: 41089, Logical Reads: 42312

But this solution relies on the physical representation of the data—specifically the fact that SQL Server uses twos complement as the storage format for integers. The previous solution is not dependant on physical representation of the data.

Note that all performance measures I presented here are ones I got on my test machine, which has two CPU cores, 4 GB RAM, and a single hard drive. On my test machine, the last solution based on the binary concatenation technique was the fastest. Others who ran the solutions on their hardware reported different measures with different ratios. All got very good performance for the concatenation techniques, but for some, other solutions got pretty close, and in cases even slightly better performance.

Cheers,

Itzik

 



ARTICLE TOOLS

Comments
  • Erik
    3 years ago
    Nov 12, 2009

    My query had some unnecessary complexity. Please forgive the additional post with a cleaned-up version.

    WITH bin AS (
    SELECT
    Grp,
    D =
    Max(
    Convert(binary(4), CASE WHEN Col1 >= 0 THEN Col1 ^ 0x80000000 ELSE Col1 & 0x7FFFFFFF END)
    + Convert(binary(4), CASE WHEN Col2 >= 0 THEN (-Col2 - 1) & 0x7FFFFFFF ELSE -(Col2 + 1) ^ 0x80000000 END)
    )
    FROM dbo.T1
    GROUP BY Grp
    )
    SELECT
    Grp,
    Col1 =
    CASE
    WHEN Substring(D, 1, 4) < 0x80000000 THEN Convert(int, Substring(D, 1, 8)) ^ 0x80000000
    ELSE Convert(int, Substring(D, 1, 4)) & 0x7FFFFFFF
    END,
    Col2 =
    CASE
    WHEN Substring(D, 5, 4) < 0x80000000 THEN -(Convert(int, Substring(D, 5, 4)) ^ 0x80000000 + 1)
    ELSE -(Convert(int, Substring(D, 5, 4)) & 0x7FFFFFFF) - 1
    END
    FROM bin

  • Erik
    3 years ago
    Nov 12, 2009

    sd2k01, that query totally bombed on my machine because it ends up doing 3 index scans and a very expensive Index Spool (Eager Spool)

    william, nice solution. For me, it performs best of all the given queries.

  • Erik
    3 years ago
    Nov 12, 2009

    gosh... is code formatting possible in comments? What an ugly blob.

  • Erik
    3 years ago
    Nov 12, 2009

    Itzik,

    I went down the binary path too (I didn't look at the answers before trying the puzzle) and got a much more cumbersome query that seems to outperform yours a little, at least on my machine. I have to say I admire your prepending of 0x01 and 0x00 to gain correct ordering. I was stuck in a "fully reversible" box. Your query is superior in my opinion because it's so much simpler, but I still have to show you this:

    WITH bin AS (
    SELECT
    Grp,
    D =
    Max(
    Convert(binary(8),
    CASE WHEN Col1 >= 0 THEN Convert(binary(4), Col1 ^ 0x80000000) ELSE Convert(binary(4), Col1 & 0x7FFFFFFF) END
    + CASE WHEN Col2 >= 0 THEN Convert(binary(4), (-Col2 - 1) & 0x7FFFFFFF) ELSE Convert(binary(4), -(Col2 + 1) ^ 0x80000000) END
    )
    )
    FROM dbo.T1
    GROUP BY Grp
    )
    SELECT
    Grp,
    Col1 =
    CASE
    WHEN Convert(int, Substring(D, 1, 4)) >= 0 THEN Convert(int, Substring(D, 1, 8)) ^ 0x80000000
    ELSE Convert(int, Substring(D, 1, 4)) & 0x7FFFFFFF
    END,
    Col2 =
    CASE
    WHEN Convert(int, Substring(D, 5, 4)) >= 0 THEN -(Convert(int, Substring(D, 5, 4)) ^ 0x80000000 + 1)
    ELSE -(Convert(int, Substring(D, 5, 4)) & 0x7FFFFFFF) - 1
    END
    FROM bin

  • William
    3 years ago
    Oct 31, 2009

    The below behaves well, on my PC at least. Admittedly not _a_ query, and uses #temp tables which might be objectionable to some, but performs well.

    SELECT grp, MAX(col1) AS MC1
    INTO #temp
    FROM dbo.T1
    GROUP BY grp

    SELECT TT.grp, TT.col1, MIN(TT.col2)
    FROM dbo.T1 TT
    INNER JOIN #temp T
    ON T.grp
    = TT.grp
    AND T.MC1
    = TT.col1
    GROUP BY TT.grp, TT.col1
    ORDER BY TT.grp, TT.col1

You must log on before posting a comment.

Are you a new visitor? Register Here
  • SP1?
    I know there is a SP1 for SQL 2008 R2 available....and there is a "feature pack" as well... ...
  • SQL database mirroring
    I have SQL Server 2008 R2 Enterprise 64bit on Windows 2008 R2 Enterprise 64bit.  Each SQL Server has...
  • Dell Compellent Disk Drive
    Does anybody has experience with Dell Compellent Disk Drive? Basically, this system manages all disk...
  • Sql server performance tuning
    I need to find a tool that help me to optimize sql server,queries,improve the performance and solve ...