• subscribe
September 29, 2009 12:00 AM

T-SQL Challenge – Efficient Partitioned TOP

SQL Server Pro
InstantDoc ID #102883

This challenge involves writing a query against a table called T1 which you create and populate by running the following code:

SET NOCOUNT ON;

USE tempdb;

 

IF OBJECT_ID('dbo.T1', 'U') IS NOT NULL DROP TABLE dbo.T1;

GO

CREATE TABLE dbo.T1

(

  id   INT         NOT NULL IDENTITY PRIMARY KEY,

  grp  VARCHAR(10) NOT NULL,

  col1 INT         NOT NULL,

  col2 INT         NOT NULL

);

GO

 

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A',  10, -2147483648);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A',  10,          -1);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A',  10,           0);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A',  10,           1);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A',  10,  2147483647);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A',   0,           0);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A',   0,           1);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A',   0,  2147483647);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', -10, -2147483648);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', -10,          -1);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', -10,           0);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B',  42,         100);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B',  42,           0);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B',  42,         200);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B',  42,          30);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B',  42,          50);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42,        -100);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42,           0);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42,        -200);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42,         -30);

INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42,         -50);

GO

 

Your task is to write a query that returns for each unique grp value one row, with the highest col1 value, and lowest col2 value among the rows with the highest col1 value. The desired result for the given sample data is:

grp        col1        col2

---------- ----------- -----------

A          10          -2147483648

B          42          0

 

The TOP option doesn’t support an OVER clause, but if it did, the solution might have looked like this:

SELECT TOP (1) OVER(PARTITION BY grp ORDER BY col1 DESC, col2)

  grp, col1, col2

FROM dbo.T1;

 

The above code of course doesn’t work in SQL Server (unfortunately), and is given here as conceptual pseudo code. For now, you can apply similar logic by using the ROW_NUMBER function like so:

WITH C AS

(

  SELECT grp, col1, col2,

    ROW_NUMBER() OVER(PARTITION BY grp ORDER BY col1 DESC, col2) AS rownum

  FROM dbo.T1

)

SELECT grp, col1, col2

FROM C

WHERE rownum = 1;

 

This solution produces the desired result; however, it doesn’t perform well when T1 has a large number of rows and there’s no index on (grp, col1 DESC, col2). In such a case, the execution plan for this solution involves sorting which is very expensive. You can see for yourself by running this solution against a large set of rows. You can use the following code to populate T1 with 10,000,000 rows:

-- Large set of sample data

 

-- DDL for GetNums Function

IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL

  DROP FUNCTION dbo.GetNums;

GO

CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE

AS

RETURN

  WITH

  L0   AS(SELECT 1 AS c UNION ALL SELECT 1),

  L1   AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),

  L2   AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),

  L3   AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),

  L4   AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),

  L5   AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),

  Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5)

  SELECT n FROM Nums WHERE n <= @n;

GO

 

-- Populate T1 with 10,000,000 rows in 10,000 groups

TRUNCATE TABLE dbo.T1;

 

INSERT INTO dbo.T1 WITH(TABLOCK) (grp, col1, col2)

  SELECT

    'Group' + CAST(ABS(CHECKSUM(NEWID()))%10000 + 1 AS VARCHAR(10)) AS grp,

    CHECKSUM(NEWID()) % 100 AS col1,

    CHECKSUM(NEWID()) AS col2

  FROM dbo.GetNums(10000000) AS Nums;

 

Remember that in reality you won’t always be able to create the ideal indexes to support every possible request in your system, and sometimes you will choose not to create new indexes taking into consideration the negative impact those will have on modifications.

So, the challenge is: can you think of a better performing solution than the one I presented earlier based on the ROW_NUMBER function, with the given large set of sample data, and assuming you cannot create any new indexes?

Note that you are not guaranteed that the values in col1 and col2 are positive; these columns can contain negative, zero, and positive values.

Please post your solution as a comment to this entry plus e-mail it to me (itzik@solidq.com).

I’ll post an entry with the solutions next week.

Cheers and good luck!

BG

 



ARTICLE TOOLS

Comments
  • bouyssou
    3 years ago
    Nov 02, 2009

    Hi all, i have test every solution without casting function.
    On my machine :
    MuadDib solution and reganwick's one takes 24 s.
    I think I find a better way that takes 14 s.
    Here is the code :

    SELECT T1.GRP,MAX(T1.COL1),MIN(T1.COL2) FROM T1
    INNER JOIN (
    SELECT GRP,MAX(COL1)AS COL1 FROM T1
    GROUP BY GRP
    ) AS TMAX ON TMAX.COL1 = T1.COL1 AND TMAX.GRP = T1.GRP
    GROUP BY T1.GRP

    regards,
    Laurent.





    SELECT T1.GRP,MAX(T1.COL1),MIN(T1.COL2) FROM T1
    INNER JOIN (
    SELECT GRP,MAX(COL1)AS COL1 FROM T1
    --WHERE GRP = 'B'
    GROUP BY GRP
    ) AS TMAX ON TMAX.COL1 = T1.COL1 AND TMAX.GRP = T1.GRP
    GROUP BY T1.GRP

  • Calvin
    3 years ago
    Oct 03, 2009

    Stevekass’s solution might not be as good as ntt123 thinks. I compared that solution with one I first tried that was similar to that of Muadib, just using a CTE rather than a nested select. Muadib’s and mine were similar enough in structure that I didn’t bother posting the solution when I saw his. Out of curiosity I compared the results to that of stevekass’s. At the time I was running against an older 32-bit single core machine. Stevekass’s consistently took over 1 minute, approximately 20 seconds longer than mine and Muadib’s. I retested the queries this morning, this time against a 64-bit dual core machine and both queries performed in approximately 18 seconds which made the difference imperceptible.

  • NATALYA
    3 years ago
    Oct 02, 2009

    the solution by stevekass is unbelivable nice,
    thanks

  • Adriaan
    3 years ago
    Oct 02, 2009

    With a smaller number of rows, the CTE solution seems more efficient than this time-proven GROUP BY syntax. Putting the two queries in a batch, the CTE takes up around 36%.

    select t1.grp, t1.col1, MIN(t1.col2) min_col2
    from (select grp, MAX(col1) max_col1 from T1 group by grp) grp_max
    inner join t1 on grp_max.grp = t1.grp and grp_max.Max_col1 = t1.col1
    group by t1.grp, t1.col1
    order by 1;

    However by the time you hit 10,000 groups, the CTE is nearer 45%. If you add an ORDER BY for the results, it hits 50%. Then after adding more rows, at some point the execution plan has items for parallellism, and more of them for the old-fashioned syntax. With sorting on the results it stays 50/50 between the two, but without sorting the balance is skipped.

    Who needs them new-thangled thingeemajigs?

  • Regan
    3 years ago
    Sep 30, 2009

    Here's my best thus far. 15 sec on my machine.


    SELECT
    MaxData.grp
    ,MaxData.col1
    ,MinData.col2
    FROM
    (
    SELECT
    grp, MAX(col1) AS col1
    FROM
    T1
    GROUP BY
    grp
    ) MaxData
    INNER JOIN
    (
    SELECT
    grp, col1, MIN(col2) AS col2
    FROM
    T1
    GROUP BY
    grp, col1
    ) MinData
    ON MaxData.grp = MinData.grp
    AND MaxData.col1 = MinData.col1

You must log on before posting a comment.

Are you a new visitor? Register Here