• subscribe
April 06, 2004 12:00 AM

Calculating Row Numbers in SQL Server 2005

SQL Server Pro
InstantDoc ID #42302
Downloads
42302.zip

SQL Server 2005, formerly code-named Yukon, will introduce an overwhelming set of new features and enhancements to different aspects of the product. One of SQL Server 2005’s major new features is its integration of the .NET framework, which will let you write stored procedures, triggers, and user-defined functions (UDFs) in any supported .NET language, such as C# or Visual Basic .NET (VB.NET). But don’t assume that T-SQL will be obsolete. The ability to develop .NET routines will complement T-SQL development rather than replacing it. In some scenarios, .NET code will be appropriate; in other scenarios, T-SQL code will be appropriate. For example, T-SQL is still the most powerful and efficient language for data access, and .NET is stronger in areas that T-SQL wasn’t designed to handle (e.g., complex algorithms, iterative logic, mathematical computations). Given the wealth of new T-SQL features and enhancements in SQL Server 2005, Microsoft obviously believes in T-SQL’s future. This column is going to focus on T-SQL development in SQL Server 2005, and compare SQL Server 2000’s T-SQL techniques with SQL Server 2005’s techniques. This month I’ll discuss the new ROW_NUMBER() function, which is one of four new analytical ranking functions (as defined by the ANSI SQL-99 OLAP extensions) that SQL Server 2005 implements.

Row Numbers
Most T-SQL programmers want to know how to calculate row numbers, but the need for row numbers is a source of passionate debate among relational database theorists and T-SQL practitioners. Row numbers are consecutive integers that your code assigns to a query’s result rows, signifying the position of a row among the other result rows according to specified order. Relational database theorists say that a table is a logical entity that represents a set, and a set has no order to its rows. So there’s no sense in assigning row numbers that signify a row position among other rows. Row numbers imply sorting and a physical view of the data. I won’t argue my side in the debate here. I’m covering the subject because many programmers need to calculate row numbers for an endless number of practical applications such as paging, TOP scenarios, ranking, and others.

An important principle to remember when writing T-SQL queries is that because a table is a logical entity representing a set, table rows have no inherent order. Calculating row numbers requires a defined order. The order-by list—list of expressions that determine the sort order—can be a single column or a combination of columns. If you want to get a deterministic result, meaning that there’s only one correct result for your query, your order-by list must uniquely identify a row. Here, I’ll cover deterministic order; next time, I’ll delve into non-deterministic order. Let’s start with our first scenario—calculating row numbers based on one unique column. Note that in the following examples, I use an Orders table that the code in Listing 1 creates and populates in both SQL Server 2000 and SQL Server 2005.

Row Numbers Based on One Unique Column
Let’s say you want to calculate row numbers for the orders in the Orders table based on the unique column orderid. Figure 1 shows the desired result. Using the following ANSI-compliant technique to calculate row numbers in SQL Server 2000 produces the desired result, but inefficiently:

SELECT
  (SELECT COUNT(*)
  FROM dbo.Orders AS O2
  WHERE O2.orderid <= O1.orderid) AS rownum,
  orderid, CONVERT(varchar(10), orderdate, 120) AS orderdate,
  empid, custid, qty
FROM dbo.Orders AS O1
ORDER BY orderid

The query retrieves all rows from one occurrence of Orders, called O1, and uses a subquery in the SELECT list to calculate the row numbers. The subquery counts the number of rows in a second occurrence of Orders, O2, that have an orderid less than or equal to the current orderid in O1. The row with the minimum orderid (10001) will have one match; the row with the next orderid (10005) will have two matches; the next will have three matches, and so on. Essentially, the number of matches is the row number. Had the column determining the order not been unique, you would have gotten duplicate values from the subquery for rows with the same value in the orderid column.

The ANSI-compliant technique is inefficient even if you have an index on orderid, because for each row, the number of index rows scanned is at least the result row number. So, for n rows in the table, SQL Server scans at least (1+n)/2*n index rows. For example, my laptop took more than half an hour to run the ANSI-compliant query on a table with 100,000 rows. Fortunately, SQL Server 2005 introduces an elegant and efficient way to calculate row numbers: using the ROW_NUMBER() function. The ORDER BY list in the ROW_NUMBER() function’s OVER clause specifies the sorting criteria for the function:

SELECT ROW_NUMBER() OVER(ORDER BY orderid) AS rownum,
  orderid, CONVERT(varchar(10), orderdate, 120) AS orderdate,
  empid, custid, qty
FROM dbo.Orders AS O1
ORDER BY orderid

Without an index, the ROW_NUMBER() function in SQL Server 2005 scans the table only once, then sorts the rows to calculate all the row numbers. The solution in SQL Server 2000 scans the table once per each base row. I already mentioned that it took the SQL Server 2000 query more than half an hour to run on my laptop with 100,000 rows in the table. In comparison, the performance of the SQL Server 2005 query with the same table is simply astonishing—only 1 second.



ARTICLE TOOLS

Comments
  • Anonymous User
    7 years ago
    Feb 10, 2005

    Great article

  • BG_SQL
    8 years ago
    Dec 18, 2004

    To Peter and Anup,

    I’m afraid this is a common misconception that you can use SELECT INTO/INSERT SELECT with an ORDER BY clause, and trust that IDENTITY values would reflect the sort. We had a whole discussion on the subject when readers commented about an article I wrote about paging (InstantDoc #44138). You can find all the details there, but I copied the important part here:

    “I discussed the issue with the developers from the engine/optimizer team. I was explicitly told that they do not guarantee that the ORDER BY would be applied before assigning the IDENTITY values in SQL Server 2000. The issue came up as a result of a case where IDENTITY values did not reflect the ORDER BY clause, and you could see in the plan the compute scalar operator generating the values before the sort. I’m afraid I cannot provide details as I’m under NDA, and the whole discussion was in a private forum. This is not considered a bug, since Microsoft never guaranteed anything about the relationship between ORDER BY and IDENTITY. And this also relates to the risk in relying on a certain “expected” behavior that is not documented in BOL or elsewhere. However, since they are very well aware that so many developers assume IDENTITY values would reflect the sort, and have lots of code relying on that, I was told that they’re going to guarantee this in SQL Server 2005. But again, in SQL Server 2005 you won’t need such a guarantee, since you’ll be able to use the ROW_NUMBER() function.”

    Regards,
    Itzik Ben-Gan

  • anup
    8 years ago
    Jul 01, 2004

    We could do the above using a temp table which is much simpler and less costlier then the way you mentioned also instead of the partition.

  • Peter Ellis
    8 years ago
    May 06, 2004

    To be fair you should have mentioned that rownumbering is possible pre SQL2005 using the Identity function and temp tables. Partitioning is also possible by subtracting offsets for each partitions minimum rownumber.

    This solutions is also <1 second for 100,000 rows, but certianly involve more code as you alluded to.

    Love your work.

  • Dottie Thompson
    8 years ago
    May 06, 2004

    Exactly what I'm looking for -- new T-SDL features in SQL 2005 and how to use them. Looking forward to the rest of the articles

You must log on before posting a comment.

Are you a new visitor? Register Here