Puzzled by T-SQL
- By
Itzik Ben-Gan
[2/3/2010]
Posted by:
Itzik Ben-Gan
Much has been said already about assignment SELECT statements that perform multi-row variable assignment with an ORDER BY clause. See Paul Nielsen’s blog on the subject as an example. The basic concept is to perform an aggregate calculation in desired order. A very common use of this technique is to perform string concatenation. Consider the following code as an example:
DECLARE @s AS VARCHAR(MAX);
SET @s = '';
SELECT @s = @s + col2 + ';'
FROM dbo.T1
ORDER BY col1;
SELECT @s;
Some expect this code to return a string with the concatenated col2 values in col1 ordering. I’m not aware of any official documentation that describes how such code should behave. The documentation only refers to single row assignments. Some (including me) are reluctant to rely on this technique, and prefer to use alternatives that have guaranteed, documented behavior (e.g., using FOR XML PATH). Reasoning for distrusting the undocumented techniques is similar to what I describe here regarding the similar multi-row UPDATE with variables technique.
A few days ago I got a beautiful example from Aviv Zucker from Intel demonstrating that this technique cannot be trusted. We simplified the repro and narrowed down the optimization reasoning in the cases where it doesn’t generate the “expected” result. In “expected,” of course I’m referring to the expectation some have for this technique to return all concatenated values based on specified ordering. Personally, I don’t have any expectations from this technique.
I got the same behavior on SQL Server 2008+SP1+CU6 and SQL Server 2005+SP3.
First, here’s code to create a helper function called GetNums that returns a sequence of integers, as well as a table called T1 populated with 100 rows of sample data:
SET NOCOUNT ON;
USE tempdb;
GO
-- Helper function GetNums to generate a sequence of numbers
IF OBJECT_ID('dbo.GetNums') IS NOT NULL DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE
AS
RETURN
WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
SELECT TOP (@n) n FROM Nums ORDER BY n;
GO
-- Create a table called T1 and populate it with sample data
IF OBJECT_ID('dbo.T1', 'U') IS NOT NULL DROP TABLE dbo.T1;
CREATE TABLE dbo.T1
(
col1 INT IDENTITY NOT NULL,
col2 VARCHAR(100) NOT NULL,
filler CHAR(2000) NULL,
CONSTRAINT PK_T1 PRIMARY KEY CLUSTERED(col1)
);
INSERT INTO dbo.T1(col2)
SELECT 'String ' + CAST(n AS VARCHAR(10))
FROM dbo.GetNums(100) AS Nums;
GO
Currently the table is organized as a B-tree with the clustered index key being col1. There are no clustered indexes defined on the table.
As the first test I ran the following code:
-- Test 1, with ORDER BY
DECLARE @s AS VARCHAR(MAX);
SET @s = '';
SELECT @s = @s + col2 + ';'
FROM dbo.T1
ORDER BY col1;
SELECT @s;
GO
I got the following output that I referred to earlier as the “expected” output by some:
String 1;String 2;String 3;String 4;String 5;String 6;String 7;String 8;String 9;String 10;String 11;String 12;String 13;String 14;String 15;String 16;String 17;String 18;String 19;String 20;String 21;String 22;String 23;String 24;String 25;String 26;String 27;String 28;String 29;String 30;String 31;String 32;String 33;String 34;String 35;String 36;String 37;String 38;String 39;String 40;String 41;String 42;String 43;String 44;String 45;String 46;String 47;String 48;String 49;String 50;String 51;String 52;String 53;String 54;String 55;String 56;String 57;String 58;String 59;String 60;String 61;String 62;String 63;String 64;String 65;String 66;String 67;String 68;String 69;String 70;String 71;String 72;String 73;String 74;String 75;String 76;String 77;String 78;String 79;String 80;String 81;String 82;String 83;String 84;String 85;String 86;String 87;String 88;String 89;String 90;String 91;String 92;String 93;String 94;String 95;String 96;String 97;String 98;String 99;String 100;
Here’s the graphical execution plan I got for this code:
And here’s the textual plan:
|--Compute Scalar(DEFINE:([Expr1003]=([@s]+[tempdb].[dbo].[T1].[col2])+';'))
|--Clustered Index Scan(OBJECT:([tempdb].[dbo].[T1].[PK_T1]), ORDERED FORWARD)
Observe that the clustered index on col1 was scanned in an ordered fashion, and then the variable assignment took place. It was a matter of optimization that things were performed in this specific order. But if we assume for a moment that this technique is not supposed to guarantee any specific order of assignment, then if it made more sense in terms of performance in certain cases to first deal with the scalar computation and then with the ordering, the result might not reflect the specified ordering. Not only this, you might end up getting an intermediate state of the concatenation. That’s exactly what the next test demonstrates. First, create a covering index on (col2, col1) and then rerun the multi-row assignment with ORDER BY code:
-- Test 2, with ORDER BY, after adding covering nc index
CREATE NONCLUSTERED INDEX idx_nc_col2 ON dbo.T1(col2, col1);
GO
DECLARE @s AS VARCHAR(MAX);
SET @s = '';
SELECT @s = @s + col2 + ';'
FROM dbo.T1
ORDER BY col1;
SELECT @s;
GO
This time I got the following output:
String 100;
Clearly, the result does not contain the concatenated values. So what happened? The answer lies in the code’s execution plan:
Here’s the textual form of the plan:
|--Sort(ORDER BY:([tempdb].[dbo].[T1].[col1] ASC))
|--Compute Scalar(DEFINE:([Expr1003]=([@s]+[tempdb].[dbo].[T1].[col2])+';'))
|--Index Scan(OBJECT:([tempdb].[dbo].[T1].[idx_nc_col2]))
The optimizer figured that with such a small set of rows it’s cheaper to scan the data from the covering nonclustered index in no particular order, then apply the computation, and then sort the data by col1. The key is in the fact that the sorting happens after the computation and not before it. It seems that the output reflects an intermediate state of the concatenation. To me it’s not that the output is incorrect here since I’ve never seen any official documentation that states what is considered correct result for such code. But yeah, if based on your intuition and observations you expected to get the concatenated col2 values in col1 ordering, here’s evidence why this technique should not be trusted. Better stick to the guaranteed and supported techniques like the one using the FOR XML PATH option.
Lastly, consider the following test without an ORDER BY clause:
-- Test 3, without ORDER BY
DECLARE @s AS VARCHAR(MAX);
SET @s = '';
SELECT @s = @s + col2 + ';'
FROM dbo.T1;
SELECT @s;
GO
I get the following output:
String 1;String 10;String 100;String 11;String 12;String 13;String 14;String 15;String 16;String 17;String 18;String 19;String 2;String 20;String 21;String 22;String 23;String 24;String 25;String 26;String 27;String 28;String 29;String 3;String 30;String 31;String 32;String 33;String 34;String 35;String 36;String 37;String 38;String 39;String 4;String 40;String 41;String 42;String 43;String 44;String 45;String 46;String 47;String 48;String 49;String 5;String 50;String 51;String 52;String 53;String 54;String 55;String 56;String 57;String 58;String 59;String 6;String 60;String 61;String 62;String 63;String 64;String 65;String 66;String 67;String 68;String 69;String 7;String 70;String 71;String 72;String 73;String 74;String 75;String 76;String 77;String 78;String 79;String 8;String 80;String 81;String 82;String 83;String 84;String 85;String 86;String 87;String 88;String 89;String 9;String 90;String 91;String 92;String 93;String 94;String 95;String 96;String 97;String 98;String 99;
Obviously there’s no guarantee for ordering here. Whether you want to rely on this technique or not to return all values concatenated when order doesn’t matter to you is up to you. I haven’t stumbled into a case where I didn’t get all concatenated values, but who’s to say there isn’t one.
Again, I’d like to thank Aviv Zucker for sending me his initial example!
Cheers,
BG
[2/1/2010]
Posted by:
Itzik Ben-Gan
While for some SQL may be only work related, to me it is much more than that. To me it’s a language that allows me to express myself, very much like English or any other language is to others. I sometimes invent challenges for myself just for fun—you might say, for the soul. So, recently I did some research related to spatial data, and while I was at it I felt the irresistible urge to produce a realistic picture with a spatial query. Here’s what I came up with:
SELECT geometry::Point(x, y, 0).STBuffer(0.3) As sig
FROM ( VALUES
(01,16),(01,17),(01,18),(01,19),(01,20),(01,21),(01,22),(01,23),(02,14),(02,15),
(02,24),(02,30),(02,31),(03,13),(03,23),(03,24),(03,25),(03,26),(03,27),(03,28),
(03,29),(03,32),(03,33),(03,34),(03,35),(04,13),(04,36),(05,10),(05,11),(05,12),
(05,37),(06,08),(06,09),(06,23),(06,38),(07,07),(07,23),(07,39),(08,06),(08,24),
(08,25),(08,39),(09,05),(09,23),(09,24),(09,25),(09,26),(09,40),(10,04),(10,23),
(10,24),(10,25),(10,26),(10,39),(11,03),(11,25),(11,39),(12,02),(12,10),(12,25),
(12,39),(13,01),(13,39),(14,01),(14,39),(15,01),(15,10),(15,14),(15,38),(16,02),
(16,07),(16,13),(16,14),(16,15),(16,17),(16,18),(16,19),(16,20),(16,21),(16,22),
(16,39),(17,02),(17,10),(17,14),(17,15),(17,16),(17,17),(17,18),(17,19),(17,20),
(17,21),(17,22),(17,23),(17,24),(17,25),(17,38),(18,02),(18,08),(18,10),(18,14),
(18,15),(18,16),(18,17),(18,18),(18,24),(18,25),(18,38),(19,03),(19,08),(19,10),
(19,15),(19,25),(19,26),(19,38),(20,03),(20,10),(20,11),(20,24),(20,25),(20,26),
(20,38),(21,04),(21,11),(21,23),(21,24),(21,25),(21,26),(21,38),(21,39),(22,05),
(22,06),(22,23),(22,24),(22,25),(22,38),(22,39),(23,07),(23,08),(23,22),(23,23),
(23,24),(23,38),(23,39),(23,40),(24,09),(24,10),(24,11),(24,12),(24,13),(24,22),
(24,23),(24,24),(24,37),(24,38),(24,39),(25,14),(25,22),(25,23),(25,24),(25,25),
(25,37),(25,38),(25,39),(26,15),(26,16),(26,22),(26,23),(26,24),(26,25),(26,26),
(26,27),(26,28),(26,29),(26,30),(26,31),(26,37),(26,38),(26,39),(27,17),(27,18),
(27,19),(27,20),(27,23),(27,24),(27,27),(27,28),(27,29),(27,30),(27,31),(27,32),
(27,33),(27,34),(27,35),(27,36),(27,37),(27,38),(27,39),(28,21),(28,22),(28,26),
(28,27),(28,28),(28,29),(28,30),(28,31),(28,32),(28,33),(28,34),(28,35),(28,36),
(28,37),(28,38),(29,26),(29,27),(29,28),(29,29),(29,30),(29,31),(29,32),(29,33),
(29,34),(29,35),(29,36),(29,37),(29,38),(29,39),(30,26),(30,28),(30,31),(30,32),
(30,33),(30,34),(30,35),(30,36),(30,37),(31,28),(31,29) ) AS Coordinates(x, y);
Run the code in SQL Server 2008 Management Studio and check out the Spatial Results tab:
This is me 10 years ago when I was young and pretty. You probably think that I used some tool that plots out the values from a given picture, but you’re wrong. I took the picture that the SQL Server Magazine folks took of me 10 years ago. I opened the picture with Microsoft Paint, flipped it vertically, saved as a monochrome bitmap, zoomed to the max, and then manually wrote down the point coordinates one by one. I kid you not!
This took me a few good hours, but it was well worth it. Now if that’s not geeky, I don’t know what is. So I challenge you to come up with your own geeky sig.
Cheers,
BG
[1/20/2010]
Posted by:
Itzik Ben-Gan
Consider cases of columns referenced in a query based on which the optimizer can theoretically use an access method that relies on index ordering. Classic examples are columns that you filter by (index seek), group by (ordered scan and stream aggregate), order by (ordered scan), etc. In most cases when you apply manipulation to the column of interest as opposed to referring only to the base column, it will prevent the optimizer from relying on index ordering. There are only a very small number of exceptions where the optimizer was coded to realize that index ordering is preserved albeit the manipulation. As a couple of examples, assuming there is an index on orderdate, the following queries have the potential to rely on index ordering:
USE InsideTSQL2008;
SELECT orderid, orderdate, custid, empid
FROM Sales.Orders
WHERE orderdate = '20070212';
|--Nested Loops(Inner Join, OUTER REFERENCES:([InsideTSQL2008].[Sales].[Orders].[orderid]))
|--Index Seek(OBJECT:([InsideTSQL2008].[Sales].[Orders].[idx_nc_orderdate]), SEEK:([InsideTSQL2008].[Sales].[Orders].[orderdate]='2007-02-12 00:00:00.000') ORDERED FORWARD)
|--Clustered Index Seek(OBJECT:([InsideTSQL2008].[Sales].[Orders].[PK_Orders]), SEEK:([InsideTSQL2008].[Sales].[Orders].[orderid]=[InsideTSQL2008].[Sales].[Orders].[orderid]) LOOKUP ORDERED FORWARD)
SELECT orderdate, COUNT(*) AS num_orders
FROM Sales.Orders
GROUP BY orderdate;
|--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1006],0)))
|--Stream Aggregate(GROUP BY:([InsideTSQL2008].[Sales].[Orders].[orderdate]) DEFINE:([Expr1006]=Count(*)))
|--Index Scan(OBJECT:([InsideTSQL2008].[Sales].[Orders].[idx_nc_orderdate]), ORDERED FORWARD)
The plan for the first query uses an index seek. The plan for the second query uses an ordered scan of the index and a stream aggregate.
However, in the following cases that apply manipulation to the column, the optimizer wasn’t coded to realize that index ordering is preserved:
SELECT orderid, orderdate, custid, empid
FROM Sales.Orders
WHERE CONVERT(CHAR(8), orderdate, 112) = '20070212';
|--Nested Loops(Inner Join, OUTER REFERENCES:([InsideTSQL2008].[Sales].[Orders].[orderid]))
|--Index Scan(OBJECT:([InsideTSQL2008].[Sales].[Orders].[idx_nc_orderdate]), WHERE:(CONVERT(char(8),[InsideTSQL2008].[Sales].[Orders].[orderdate],112)='20070212'))
|--Clustered Index Seek(OBJECT:([InsideTSQL2008].[Sales].[Orders].[PK_Orders]), SEEK:([InsideTSQL2008].[Sales].[Orders].[orderid]=[InsideTSQL2008].[Sales].[Orders].[orderid]) LOOKUP ORDERED FORWARD)
SELECT YEAR(orderdate) AS orderyear, COUNT(*) AS num_orders
FROM Sales.Orders
GROUP BY YEAR(orderdate);
|--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1007],0)))
|--Hash Match(Aggregate, HASH:([Expr1003]), RESIDUAL:([Expr1003] = [Expr1003]) DEFINE:([Expr1007]=COUNT(*)))
|--Compute Scalar(DEFINE:([Expr1003]=datepart(year,[InsideTSQL2008].[Sales].[Orders].[orderdate])))
|--Index Scan(OBJECT:([InsideTSQL2008].[Sales].[Orders].[idx_nc_orderdate]))
In the first case the plan shows a full scan of the index (as opposed to the desired index seek), and in the second case the plan shows an unordered index scan followed by a hash aggregate (as opposed to an ordered index scan followed by a stream aggregate).
In some cases there are alternatives that provide the same functionality without applying manipulation to the column of interest. For example, when filtering a date and time period like a whole month, you should use a range filter as opposed to manipulating the filtered column. BTW, in SQL Server 2008 the optimizer was coded to realize that the expression CAST(orderdate AS DATE) preserves index ordering, therefore the predicate CAST(orderdate AS DATE) = '20070212' can potentially be handled with an index seek. But the main point I wanted to make in this entry is different…
You can create a computed column based on the expression of interest; index the computed column; and the optimizer will consider using such an index even when the query doesn’t refer to the computed column name, rather keeps the original expression. As an example, the following code creates two computed columns and indexes on them:
ALTER TABLE Sales.Orders ADD
orderdatestr AS CONVERT(CHAR(8), orderdate, 112),
orderyear AS YEAR(orderdate);
CREATE INDEX idx_orderdatestr ON Sales.Orders(orderdatestr);
CREATE INDEX idx_orderyear ON Sales.Orders(orderyear);
This time the plans for the previous queries with the manipulated columns do rely on index ordering:
SELECT orderid, orderdate, custid, empid
FROM Sales.Orders
WHERE CONVERT(CHAR(8), orderdate, 112) = '20070212';
|--Nested Loops(Inner Join, OUTER REFERENCES:([InsideTSQL2008].[Sales].[Orders].[orderid]))
|--Index Seek(OBJECT:([InsideTSQL2008].[Sales].[Orders].[idx_orderdatestr]), SEEK:([InsideTSQL2008].[Sales].[Orders].[orderdatestr]='20070212') ORDERED FORWARD)
|--Clustered Index Seek(OBJECT:([InsideTSQL2008].[Sales].[Orders].[PK_Orders]), SEEK:([InsideTSQL2008].[Sales].[Orders].[orderid]=[InsideTSQL2008].[Sales].[Orders].[orderid]) LOOKUP ORDERED FORWARD)
SELECT YEAR(orderdate) AS orderyear, COUNT(*) AS num_orders
FROM Sales.Orders
GROUP BY YEAR(orderdate);
|--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[Expr1007],0)))
|--Stream Aggregate(GROUP BY:([Expr1003]) DEFINE:([Expr1007]=Count(*)))
|--Compute Scalar(DEFINE:([Expr1003]=[InsideTSQL2008].[Sales].[Orders].[orderyear]))
|--Index Scan(OBJECT:([InsideTSQL2008].[Sales].[Orders].[idx_orderyear]), ORDERED FORWARD)
Observe that in the first case the plan shows an index seek in the new index, and in the second case it shows an ordered index scan and a stream aggregate.
Here’s some cleanup code:
DROP INDEX Sales.Orders.idx_orderdatestr, Sales.Orders.idx_orderyear;
ALTER TABLE Sales.Orders DROP COLUMN orderdatestr, orderyear;
Cheers,
BG
[12/27/2009]
Posted by:
Itzik Ben-Gan
Many date-related calculations in SQL Server depend on your language settings. For example, consider the following code asking for the week day number of a date that is a Monday under both us_english and British languages:
SET LANGUAGE us_english;
SELECT DATEPART(weekday, '20091228');
SET LANGUAGE British;
SELECT DATEPART(weekday, '20091228');
This code generates the following output:
Changed language setting to us_english.
-----------
2
Changed language setting to British.
-----------
1
Under us_english your session’s DATEFIRST setting is set implicitly to 7 (meaning that Sunday is the first day of the week), and hence you get the value 2 for a week day of a date that is a Monday. Under British DATEFIRST is set to 1 (meaning Monday is the first day of the week), and hence you get the value 1 as the week day of a date that is a Monday. The user’s effective language setting is reflected in such calculations.
Apparently, things are different with the DATEDIFF function. If you ask for the difference in terms of weeks between two dates, this function will ignore your language setting (and more specifically your DATEFIRST setting), and always perform the calculation assuming Sunday as the first day of the week. As an example, run the following code:
SET LANGUAGE us_english;
SELECT DATEDIFF(week, '20091227', '20091228');
SET LANGUAGE British;
SELECT DATEDIFF(week, '20091227', '20091228');
You will get 0 as the output in both cases; even though under the British language you might expect the result to be 1 since assuming Monday is the first day of the week one week boundary is crossed.
I learned about this just recently and thought this was a bug, since my intuition told me that such a calculation should be language-dependent. Turns out that SQL Server MVP Pawel Potasinski already opened a bug on this item, but it was closed with the following response from Microsoft:
“It looks like this was closed without providing any feedback. The rationale for closing this was that changing the behavior to honor DATEFIRST would make it non-deterministic, and this would prevent building indexes on computed columns or views involving it. This behavior has been around for several versions, so this is a breaking change that doesn't seem worth it.”
Since it looks like Microsoft does not consider it a bug and is not planning to change this behavior I wanted to offer a workaround in case you do need the calculation to be language-dependent. Simply subtract @@DATEFIRST days from both dates in the calculation. The calculation will still assume Sunday as the first day of the week, but the dates the calculation will operate on won’t be the original dates rather manipulated ones. Shifted in exactly the right number of days (@@DATEFIRST) backwards in time such that the calculation will behave as if it is language-dependent.
Here’s how to apply this technique to the dates used in the previous example:
SET LANGUAGE us_english;
SELECT DATEDIFF( week,
DATEADD(day, -@@DATEFIRST, '20091227'),
DATEADD(day, -@@DATEFIRST, '20091228') );
SET LANGUAGE British;
SELECT DATEDIFF( week,
DATEADD(day, -@@DATEFIRST, '20091227'),
DATEADD(day, -@@DATEFIRST, '20091228') );
This code generates the following output, as expected:
Changed language setting to us_english.
-----------
0
Changed language setting to British.
-----------
1
Cheers,
BG
[12/6/2009]
Posted by:
Itzik Ben-Gan
Last week I presented a logic puzzle involving crossing a desert that separates two towns. You can find the puzzle details here. Thanks to Peter Larsson (Pesomanen), fastvince, and cwestervelt for posting your solutions, and to Will Alber for correcting an embarrassing typo I had originally.
In last week’s puzzle I mentioned that it takes six days to cross the desert, but a single person can carry only four days worth of food supplies. You are allowed to hire porters. The goal is to cross the desert while paying for the minimum number of porter days.
First, after looking at the suggested solutions and the comments, a few clarifications are in place…
If you hire porters for a portion of the trip, you are supposed to pay them also to get back home. I thought this part was obvious, but in retrospect I should have been explicit about it.
Having mentioned that the goal is to pay for the minimum number of porter days in order to minimize costs, I hope it was clear that the cost of food per day is much lower than the actual porter fee per day, such that it can be ignored.
Lastly, this puzzle is pretty old—older than phones and simple ways of communication. So the assumption is that there is no simple way to communicate with porters from the destination town, coordinating a meeting along the way. But not having mentioned this fact, I find cwestervelt’s solution, which relies on such logic, pretty clever, demonstrating thinking outside the box.
Let’s start with the expected solution (call it Solution 1) according to my father:
You start your journey with two porters; each of you carries food for four days. After one day porter 1 gives you and porter 2 food for one day each, and goes back home with the remaining food. After another day porter 2 and you have food for three days each. Porter 2 gives you food four one day and goes back home. You now have food for four days and need four days to reach your destination. In total, you paid for six porter days (not three). You arrived to the destination in six days, though recall that minimal timing was not a goal.
As noted earlier, last week I neglected to mention that this puzzle is ancient, and that there are no simple ways to communicate with porters from the destination town to coordinate meeting along the way. Without the communications difficulty, cwestervelt’s solution (call it Solution 2) would have been valid:
You start with food for four days and take one porter (porter 1) with you with food for three days. You coordinate with another porter (porter 2) from the destination town to leave after four days with food for three days such that you will meet after five days. After the first day porter 1 gives you food for one day and goes back home. You now have food for four days. You meet porter 2 after five days, get food for one day from him, and you both go to the destination town. In total, you paid for four porter days. You arrived to the destination in six days.
My solution, though not the expected one according to my father, involves paying for zero porter days, and arriving to the destination in 12 days. You start with food for four days. After one day you leave food for two days at that place (call the place point 1) and go back to the town of origin. You take food for four days and start again. After one day you take food for one day from point 1, leaving you with food for four days, and food for one day remains in point 1. You proceed another day and leave food for two days at that place (call the place point 2), leaving you with food for one day. You go back to point 1, grab the food that you left there, and go back to the town of origin. You take food for four days, and start again. After two days you pick up the food from point 2, and now you have food for four days with four days left to reach the destination. In total, you paid for zero porter days, and reached the destination in 12 days. Since the goal was to pay for the minimum number of porter days and not necessarily to reach the destination as fast as possible, I think that this solution should be considered valid.
Cheers,
BG
[11/30/2009]
Posted by:
Itzik Ben-Gan
I got this one from my father. You are currently in one of two towns separated by a desert. You need to cross the desert to get to the other town. It takes 6 days to cross the desert, but a single person can carry only 4 days worth of food supply. You can hire porters to help you carry your food supply, but naturally you want to minimize the costs as much as possible. What would be the plan that will enable you to cross the desert while paying for the minimum possible number of porter days?
I’ll post the solution next week.
Cheers, BG
[11/24/2009]
Posted by:
Itzik Ben-Gan
Recently I wrote a two-part series in my T-SQL Black Belt column about Calculating Concurrent Sessions (Part I – November 2009, Part II – December 2009). I used the following code to create a table called Sessions and populate it with sample data:
SET NOCOUNT ON;
USE tempdb;
IF OBJECT_ID('dbo.Sessions', 'U') IS NOT NULL DROP TABLE dbo.Sessions;
CREATE TABLE dbo.Sessions
(
keycol INT NOT NULL,
app VARCHAR(10) NOT NULL,
usr VARCHAR(10) NOT NULL,
host VARCHAR(10) NOT NULL,
starttime DATETIME NOT NULL,
endtime DATETIME NOT NULL,
CONSTRAINT PK_Sessions PRIMARY KEY(keycol),
CHECK(endtime > starttime)
);
GO
CREATE INDEX idx_nc_app_st_et ON dbo.Sessions(app, starttime, endtime);
CREATE INDEX idx_nc_app_et_st ON dbo.Sessions(app, endtime, starttime);
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(2, 'app1', 'user1', 'host1', '20090212 08:30', '20090212 10:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(3, 'app1', 'user2', 'host1', '20090212 08:30', '20090212 08:45');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(5, 'app1', 'user3', 'host2', '20090212 09:00', '20090212 09:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(7, 'app1', 'user4', 'host2', '20090212 09:15', '20090212 10:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(11, 'app1', 'user5', 'host3', '20090212 09:15', '20090212 09:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(13, 'app1', 'user6', 'host3', '20090212 10:30', '20090212 14:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(17, 'app1', 'user7', 'host4', '20090212 10:45', '20090212 11:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(19, 'app1', 'user8', 'host4', '20090212 11:00', '20090212 12:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(23, 'app2', 'user8', 'host1', '20090212 08:30', '20090212 08:45');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(29, 'app2', 'user7', 'host1', '20090212 09:00', '20090212 09:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(31, 'app2', 'user6', 'host2', '20090212 11:45', '20090212 12:00');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(37, 'app2', 'user5', 'host2', '20090212 12:30', '20090212 14:00');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(41, 'app2', 'user4', 'host3', '20090212 12:45', '20090212 13:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(43, 'app2', 'user3', 'host3', '20090212 13:00', '20090212 14:00');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(47, 'app2', 'user2', 'host4', '20090212 14:00', '20090212 16:30');
INSERT INTO dbo.Sessions(keycol, app, usr, host, starttime, endtime)
VALUES(53, 'app2', 'user1', 'host4', '20090212 15:30', '20090212 17:00');
I presented a task to calculate the maximum number of concurrent sessions per each application and provided several solutions. I got great feedback from readers, including a beautiful new set-based solution that I will present in Part III in the series, which will probably be published in the March 2010 edition of SQL Server Magazine.
In the meanwhile I wanted to cover here a variation of the task that came up in a conversation I had with a fellow SQL Server MVP and SQL wizard, Peter Larsson (aka peso).
In Part II of the series I presented the following solution returning the maximum number of concurrent sessions per application:
-- Part 1
CREATE TABLE #Ends
(
app VARCHAR(10) NOT NULL,
endtime DATETIME,
n BIGINT
);
CREATE CLUSTERED INDEX idx_app_et ON #Ends(app, endtime);
INSERT INTO #Ends(app, endtime, n)
SELECT app, endtime,
RANK() OVER(PARTITION BY app ORDER BY endtime) AS n
FROM dbo.Sessions;
-- Part 2
WITH Counts AS
(
SELECT S.app, S.starttime,
ROW_NUMBER() OVER(PARTITION BY S.app ORDER BY S.starttime)
- A.n + 1 AS cnt
FROM dbo.Sessions AS S
CROSS APPLY (SELECT TOP (1) E.n
FROM #Ends AS E
WHERE E.app = S.app
AND E.endtime > S.starttime
ORDER BY E.endtime) AS A
)
SELECT app, MAX(cnt) AS mx
FROM Counts
GROUP BY app;
DROP TABLE #Ends;
As I described in the article, Part 1 of the solution populates a temporary table with ranked session end events. Part 2 assigns row numbers to session start events, and calculates the number of concurrent sessions at each start event by subtracting from the start event row number the rank of the next end event and adding 1. Finally, the outermost query groups the data by application, and returns the maximum count of concurrent sessions per each application.
In a conversation I had with Peter we discussed a variation of the problem where you also want to get back the start and end times of the actual intervals that appeared most often, and not just the max count. You can address this variation of the problem by applying a couple of revisions to the above solution to the original task. First, in the Counts CTE query return both the starttime and the endtime attributes so that you will be able to return information about the actual intervals in the outer query. Second, in the outer query use the option TOP (1) WITH TIES based on the ordering specification: ORDER BY RANK() OVER(PARTITION BY app ORDER BY cnt DESC). All intervals that occur most often per application will be ranked 1, hence this TOP specification will return all of them. Here’s the complete solution:
-- Part 1
CREATE TABLE #Ends
(
app VARCHAR(10) NOT NULL,
endtime DATETIME,
n BIGINT
);
CREATE CLUSTERED INDEX idx_app_et ON #Ends(app, endtime);
INSERT INTO #Ends(app, endtime, n)
SELECT app, endtime,
RANK() OVER(PARTITION BY app ORDER BY endtime) AS n
FROM dbo.Sessions;
-- Part 2
WITH Counts AS
(
SELECT S.app, S.starttime, A.endtime,
ROW_NUMBER() OVER(PARTITION BY S.app ORDER BY S.starttime)
- A.n + 1 AS cnt
FROM dbo.Sessions AS S
OUTER APPLY (SELECT TOP (1) E.endtime, E.n
FROM #Ends AS E
WHERE E.app = S.app
AND E.endtime > S.starttime
ORDER BY E.endtime) AS A
)
SELECT TOP (1) WITH TIES *
FROM Counts
ORDER BY RANK() OVER(PARTITION BY app ORDER BY cnt DESC);
DROP TABLE #Ends;
Note that in case there’s more than one interval that appears the maximum number of times, this solution will return all ties. If you want to return only one interval per application, use the ROW_NUMBER function instead of RANK, and add a tie breaker to the ROW_NUMBER’s ORDER BY list if you want the solution to be deterministic. As an example, if you want the earliest interval to win in case of ties, add starttime, endtime to the ORDER BY list, like so:
ORDER BY ROW_NUMBER() OVER(PARTITION BY app ORDER BY cnt DESC, starttime, endtime);
Cheers,
BG
[10/29/2009]
Posted by:
Itzik Ben-Gan
An auxiliary table of numbers is a helper table that contains a sequence of integers from 1 and on. It is a very handy helper table that I use to solve many different types of problems. In case you cannot, or do not want to generate a permanent table, you can produce a virtual one on the fly very efficiently. The trick is to use cross joins. You start with a virtual table with two rows:
WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1)
Then, you perform a cross join between two instances of this table:
WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B)
Then, you perform a cross join between two instances of the last table:
WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B)
And after applying such cross joins five time, you have a table with 2^2^5 (4,294,967,296) rows:
WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B)
To produce the actual sequence of numbers, you can use the ROW_NUMBER function with ORDER BY (SELECT NULL), letting the optimizer know that it doesn’t really need to sort the data.
You can create a table function based on this code, and request a certain number of numbers by passing an input parameter. The tricky part is to come up with a solution that stops processing the Cartesian products as soon as the requested number of numbers was produced, and not always try to generate all four billion of those. Until recently, I used the following solution:
IF OBJECT_ID('dbo.GetNums') IS NOT NULL DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE
AS
RETURN
WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
SELECT n FROM Nums WHERE n <= @n;
GO
To return 10 numbers you would simply pass the value 10 as input:
SELECT n FROM dbo.GetNums(10) AS Nums;
This generates the following output:
n
---
1
2
3
4
5
6
7
8
9
10
This code finishes fast and does stop processing after 10 rows are produced. If you look at the execution plan produced for this query, you will notice a Top operator that is in charge of stopping the processing at the right point. However, I noticed that in certain cases the optimizer removes the Top operator, and actually tries to generate all four billion rows, before the filtering part. Here’s one such example:
WITH Primes(p) AS
(
SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 5
UNION ALL SELECT 7
)
SELECT *
FROM Primes
CROSS APPLY dbo.GetNums(Primes.p) AS Nums;
Recently I revised the solution to use the TOP option instead of the WHERE filter like so:
IF OBJECT_ID('dbo.GetNums') IS NOT NULL DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE
AS
RETURN
WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
SELECT TOP (@n) n FROM Nums ORDER BY n;
GO
And for this solution the optimizer always seems to use the Top operator and stop processing in the right place. Try the following after recreating the function using the new version:
WITH Primes(p) AS
(
SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 5
UNION ALL SELECT 7
)
SELECT *
FROM Primes
CROSS APPLY dbo.GetNums(Primes.p) AS Nums;
And this time it should finish quickly.
Cheers,
BG
[10/5/2009]
Posted by:
Itzik Ben-Gan
Last week I provided a T-SQL challenge entitled “Efficient Partitioned TOP” involving finding an efficient solution to a given task when you cannot create new indexes on the queried table. You can find the puzzle details here. I gave code to populate the queried table T1 with 10,000,000 rows of sample data. I provided the following solution:
WITH C AS
(
SELECT grp, col1, col2,
ROW_NUMBER() OVER(PARTITION BY grp ORDER BY col1 DESC, col2) AS rownum
FROM dbo.T1
)
SELECT grp, col1, col2
FROM C
WHERE rownum = 1;
When there’s no index on (grp, col1 DESC, col2) the execution plan for this solution involves sorting the 10,000,000 rows, which is very expensive. Here are the performance measures I got for this solution on my test machine:
Elapsed time: 83221, CPU: 120167, Logical Reads: 42472
The data is scanned once, but the sort operation contributes to the majority of the cost of the query. The challenge was to find a more efficient solution assuming creating a new index is not an option.
Thanks to all those who submitted solutions: Peter Larsson, Plamen Ratchev, Steve Kass, MuadDib, Rudolf, Paula North, Rob Farley, Regan Wick, Adriaan Simons, and cwestervelt.
Most people (Peter Larsson, Plamen Ratchev, MuadDib, Rudolf, Paula North, Regan Wick, Adriaan Simons, cwestervelt) submitted a pretty straightforward solution based on grouping, joining, and then grouping again:
SELECT A.grp, A.col1, MIN(B.col2) AS col2
FROM ( SELECT grp,
MAX(col1) AS col1
FROM dbo.T1
GROUP BY grp ) AS A
JOIN dbo.T1 AS B
ON B.grp = A.grp
AND B.col1 = A.col1
GROUP BY A.grp, A.col1;
The performance measures I got for this solution are:
Elapsed time: 30798, CPU: 52961, Logical Reads: 84584
The plan for this solution shows that the data is scanned twice, but still it is substantially faster than the solution based on the ROW_NUMBER function because it doesn’t involve sorting. Both the aggregations and the join are handled with hashing. With such a large number of rows, handling the aggregations with hashing is more efficient than sorting, albeit the extra scan of the data.
Peter Larsson also provided a couple of other solutions based on similar logic. One using the APPLY operator:
SELECT A.grp, A.col1, B.col2
FROM ( SELECT grp,
MAX(col1) AS col1
FROM dbo.T1
GROUP BY grp ) AS A
CROSS APPLY ( SELECT MIN(C.col2) AS col2
FROM dbo.T1 AS C
WHERE C.grp = A.grp
AND C.col1 = A.col1) AS B;
Elapsed time: 29311, CPU: 52619, Logical Reads: 84540
And another using correlated subqueries with aggregate functions:
SELECT DISTINCT A.grp, A.col1, A.col2
FROM dbo.T1 AS A
WHERE A.col1 = (SELECT MAX(col1)
FROM dbo.T1 AS B
WHERE B.grp = A.grp)
AND A.col2 = (SELECT MIN(col2)
FROM DBO.T1 AS C
WHERE C.grp = A.grp AND c.col1 = a.col1);
Elapsed time: 32623, CPU: 60185, Logical Reads: 126864
Steve Kass, Rob Farley and I used a completely different approach to solving the problem. You use some method to concatenate the values of col1 and col2 after some manipulation that inverses the comparison/ordering behavior of col2. You then group the rows by grp and calculate the maximum of the concatenated values (call the result mx). Finally, you extract the two elements from mx, and apply manipulation than inverses the part representing col2 back.
Steve and Rob used numeric manipulation to handle the math, and character manipulation to handle the concatenation. Here’s Steve’s solution:
WITH C AS
(
SELECT grp,
MAX( CAST(5000000000.+col1 AS CHAR(10))
+ CAST(5000000000.-col2 AS CHAR(10)) ) AS mx
FROM dbo.T1
GROUP BY grp
)
SELECT grp,
CAST(CAST(LEFT(mx,10) AS DECIMAL(10,0))-5000000000. AS INT) AS col1,
CAST(5000000000.-CAST(RIGHT(mx,10) AS DECIMAL(10,0)) AS INT) AS col2
FROM C;
The plan for this solution involves only one scan of the data, and a hash aggregate. But due to the length and type of the aggregated element, the hash aggregate in this solution is more expensive than the hash aggregates used in the previous solutions. Here are the performance measures I got for this solution:
Elapsed time: 30906, CPU: 56597, Logical Reads: 42336
I used binary manipulation to produce a single binary string representing the concatenated col1 value and the inversed col2 value. I also added prefixes to the values to make sure that their comparison/ordering behavior will be in accord with the integer values. Here’s the solution:
WITH C AS
(
SELECT grp,
MAX( CASE WHEN col1 >= 0 THEN 0x01 ELSE 0x00 END + CAST(col1 AS BINARY(4))
+ CASE WHEN col2 >= 0 THEN 0x00 ELSE 0x01 END + CAST(~col2 AS BINARY(4)) ) AS mx
FROM dbo.T1
GROUP BY grp
)
SELECT grp,
CAST(SUBSTRING(mx, 2, 4) AS INT) AS col1,
~CAST(SUBSTRING(mx, 7, 4) AS INT) AS col2
FROM C;
The concatenated string is shorter than in the previous solution, plus the aggregation of binary values is less expensive than aggregation of character values. Therefore, this solution is faster than the previous one. Here are the performance measures I got for it:
Elapsed time: 22183, CPU: 41089, Logical Reads: 42312
But this solution relies on the physical representation of the data—specifically the fact that SQL Server uses twos complement as the storage format for integers. The previous solution is not dependant on physical representation of the data.
Note that all performance measures I presented here are ones I got on my test machine, which has two CPU cores, 4 GB RAM, and a single hard drive. On my test machine, the last solution based on the binary concatenation technique was the fastest. Others who ran the solutions on their hardware reported different measures with different ratios. All got very good performance for the concatenation techniques, but for some, other solutions got pretty close, and in cases even slightly better performance.
Cheers,
Itzik
[9/29/2009]
Posted by:
Itzik Ben-Gan
This challenge involves writing a query against a table called T1 which you create and populate by running the following code:
SET NOCOUNT ON;
USE tempdb;
IF OBJECT_ID('dbo.T1', 'U') IS NOT NULL DROP TABLE dbo.T1;
GO
CREATE TABLE dbo.T1
(
id INT NOT NULL IDENTITY PRIMARY KEY,
grp VARCHAR(10) NOT NULL,
col1 INT NOT NULL,
col2 INT NOT NULL
);
GO
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, -2147483648);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, -1);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, 0);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, 1);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 10, 2147483647);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 0, 0);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 0, 1);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', 0, 2147483647);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', -10, -2147483648);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', -10, -1);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('A', -10, 0);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 100);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 0);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 200);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 30);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', 42, 50);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, -100);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, 0);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, -200);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, -30);
INSERT INTO dbo.T1(grp, col1, col2) VALUES('B', -42, -50);
GO
Your task is to write a query that returns for each unique grp value one row, with the highest col1 value, and lowest col2 value among the rows with the highest col1 value. The desired result for the given sample data is:
grp col1 col2
---------- ----------- -----------
A 10 -2147483648
B 42 0
The TOP option doesn’t support an OVER clause, but if it did, the solution might have looked like this:
SELECT TOP (1) OVER(PARTITION BY grp ORDER BY col1 DESC, col2)
grp, col1, col2
FROM dbo.T1;
The above code of course doesn’t work in SQL Server (unfortunately), and is given here as conceptual pseudo code. For now, you can apply similar logic by using the ROW_NUMBER function like so:
WITH C AS
(
SELECT grp, col1, col2,
ROW_NUMBER() OVER(PARTITION BY grp ORDER BY col1 DESC, col2) AS rownum
FROM dbo.T1
)
SELECT grp, col1, col2
FROM C
WHERE rownum = 1;
This solution produces the desired result; however, it doesn’t perform well when T1 has a large number of rows and there’s no index on (grp, col1 DESC, col2). In such a case, the execution plan for this solution involves sorting which is very expensive. You can see for yourself by running this solution against a large set of rows. You can use the following code to populate T1 with 10,000,000 rows:
-- Large set of sample data
-- DDL for GetNums Function
IF OBJECT_ID('dbo.GetNums', 'IF') IS NOT NULL
DROP FUNCTION dbo.GetNums;
GO
CREATE FUNCTION dbo.GetNums(@n AS BIGINT) RETURNS TABLE
AS
RETURN
WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS(SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS(SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS n FROM L5)
SELECT n FROM Nums WHERE n <= @n;
GO
-- Populate T1 with 10,000,000 rows in 10,000 groups
TRUNCATE TABLE dbo.T1;
INSERT INTO dbo.T1 WITH(TABLOCK) (grp, col1, col2)
SELECT
'Group' + CAST(ABS(CHECKSUM(NEWID()))%10000 + 1 AS VARCHAR(10)) AS grp,
CHECKSUM(NEWID()) % 100 AS col1,
CHECKSUM(NEWID()) AS col2
FROM dbo.GetNums(10000000) AS Nums;
Remember that in reality you won’t always be able to create the ideal indexes to support every possible request in your system, and sometimes you will choose not to create new indexes taking into consideration the negative impact those will have on modifications.
So, the challenge is: can you think of a better performing solution than the one I presented earlier based on the ROW_NUMBER function, with the given large set of sample data, and assuming you cannot create any new indexes?
Note that you are not guaranteed that the values in col1 and col2 are positive; these columns can contain negative, zero, and positive values.
Please post your solution as a comment to this entry plus e-mail it to me (itzik@solidq.com).
I’ll post an entry with the solutions next week.
Cheers and good luck!
BG
|
|