SQL Server 2005, formerly code-named Yukon, will introduce an overwhelming set of new features and enhancements to different aspects of the product. One of SQL Server 2005’s major new features is its integration of the .NET framework, which will let you write stored procedures, triggers, and user-defined functions (UDFs) in any supported .NET language, such as C# or Visual Basic .NET (VB.NET). But don’t assume that T-SQL will be obsolete. The ability to develop .NET routines will complement T-SQL development rather than replacing it. In some scenarios, .NET code will be appropriate; in other scenarios, T-SQL code will be appropriate. For example, T-SQL is still the most powerful and efficient language for data access, and .NET is stronger in areas that T-SQL wasn’t designed to handle (e.g., complex algorithms, iterative logic, mathematical computations). Given the wealth of new T-SQL features and enhancements in SQL Server 2005, Microsoft obviously believes in T-SQL’s future. This column is going to focus on T-SQL development in SQL Server 2005, and compare SQL Server 2000’s T-SQL techniques with SQL Server 2005’s techniques. This month I’ll discuss the new ROW_NUMBER() function, which is one of four new analytical ranking functions (as defined by the ANSI SQL-99 OLAP extensions) that SQL Server 2005 implements.
Row Numbers
Most T-SQL programmers want to know how to calculate row numbers, but the need for row numbers is a source of passionate debate among relational database theorists and T-SQL practitioners. Row numbers are consecutive integers that your code assigns to a query’s result rows, signifying the position of a row among the other result rows according to specified order. Relational database theorists say that a table is a logical entity that represents a set, and a set has no order to its rows. So there’s no sense in assigning row numbers that signify a row position among other rows. Row numbers imply sorting and a physical view of the data. I won’t argue my side in the debate here. I’m covering the subject because many programmers need to calculate row numbers for an endless number of practical applications such as paging, TOP scenarios, ranking, and others.
An important principle to remember when writing T-SQL queries is that because a table is a logical entity representing a set, table rows have no inherent order. Calculating row numbers requires a defined order. The order-by list—list of expressions that determine the sort order—can be a single column or a combination of columns. If you want to get a deterministic result, meaning that there’s only one correct result for your query, your order-by list must uniquely identify a row. Here, I’ll cover deterministic order; next time, I’ll delve into non-deterministic order. Let’s start with our first scenario—calculating row numbers based on one unique column. Note that in the following examples, I use an Orders table that the code in Listing 1 creates and populates in both SQL Server 2000 and SQL Server 2005.
Row Numbers Based on One Unique Column
Let’s say you want to calculate row numbers for the orders in the Orders table based on the unique column orderid. Figure 1 shows the desired result. Using the following ANSI-compliant technique to calculate row numbers in SQL Server 2000 produces the desired result, but inefficiently:
SELECT
(SELECT COUNT(*)
FROM dbo.Orders AS O2
WHERE O2.orderid <= O1.orderid) AS rownum,
orderid, CONVERT(varchar(10), orderdate, 120) AS orderdate,
empid, custid, qty
FROM dbo.Orders AS O1
ORDER BY orderid
The query retrieves all rows from one occurrence of Orders, called O1, and uses a subquery in the SELECT list to calculate the row numbers. The subquery counts the number of rows in a second occurrence of Orders, O2, that have an orderid less than or equal to the current orderid in O1. The row with the minimum orderid (10001) will have one match; the row with the next orderid (10005) will have two matches; the next will have three matches, and so on. Essentially, the number of matches is the row number. Had the column determining the order not been unique, you would have gotten duplicate values from the subquery for rows with the same value in the orderid column.
The ANSI-compliant technique is inefficient even if you have an index on orderid, because for each row, the number of index rows scanned is at least the result row number. So, for n rows in the table, SQL Server scans at least (1+n)/2*n index rows. For example, my laptop took more than half an hour to run the ANSI-compliant query on a table with 100,000 rows. Fortunately, SQL Server 2005 introduces an elegant and efficient way to calculate row numbers: using the ROW_NUMBER() function. The ORDER BY list in the ROW_NUMBER() function’s OVER clause specifies the sorting criteria for the function:
SELECT ROW_NUMBER() OVER(ORDER BY orderid) AS rownum,
orderid, CONVERT(varchar(10), orderdate, 120) AS orderdate,
empid, custid, qty
FROM dbo.Orders AS O1
ORDER BY orderid
Without an index, the ROW_NUMBER() function in SQL Server 2005 scans the table only once, then sorts the rows to calculate all the row numbers. The solution in SQL Server 2000 scans the table once per each base row. I already mentioned that it took the SQL Server 2000 query more than half an hour to run on my laptop with 100,000 rows in the table. In comparison, the performance of the SQL Server 2005 query with the same table is simply astonishing—only 1 second.