SideBar    Catch That Bug! September 2006, Logical Puzzle, September 2006
DOWNLOAD THE CODE:
Download the Code 92743.zip

Many T-SQL query problems involve matching rows based on their location in a sequence—typically, a temporal sequence. A classic example of such a problem is matching each customer’s current and preceding orders to compare measures and analyze trends. There are several possible approaches to a solution. Here, I compare the performance and I/O costs of a cursor solution and two set-based solutions for matching current and preceding rows and share some query optimizer tips along the way.

Exploring Customer Order Trends
To demonstrate the three solutions, I use the SalesOrderHeader table in the AdventureWorks database. The query in Listing 1 returns the columns of interest for our problem, as Web Table 1 shows in abbreviated form.

To determine trends in customer orders, suppose the sales manager asks you to return the following information for each order:

  • current customer ID
  • current order date
  • current sales order ID
  • current total due
  • the elapsed time in days between the order dates for the customer’s preceding and current orders (call it Elapsed)
  • the difference in total due for the current and preceding orders (call it Diff)
  • the percentage difference in total due for the current and preceding orders (call it Pct)

To determine the preceding order, your code should consider OrderDate as the first sort column and SalesOrderID as the second sort column. A customer can have more than one order with the same OrderDate, and SalesOrderID sorting might not reflect OrderDate sorting. Table 1 shows the desired result (in abbreviated form) for this task.

Before you start developing your solution, create the following covering index to speed up your code:

CREATE UNIQUE INDEX 
  idx_cid_od_oid_i_td 
  ON Sales.SalesOrderHeader 
  (CustomerID, OrderDate, 
  SalesOrderID) 
  INCLUDE(TotalDue); 

Note that I defined the TotalDue column as an included non-key column in the index because it doesn’t serve any sorting or filtering purpose—it’s included just for covering purposes.

You can use the SalesOrderHeader table in the AdventureWorks database to check the logic and accuracy of your solution by comparing the result you get with the desired result shown in Table 1. However, this table is too small for performance tests because it contains only 31,465 rows. For performance testing, run the code in Web Listing 1 to create in tempdb a SalesOrderHeader table that contains 1 million rows. Make sure you set the database context to AdventureWorks when checking the validity of your solution and to tempdb when testing performance.

All the performance measures I present are based on the larger table in tempdb. My performance measures also focus on server processing time by excluding the time it takes to generate the output in SQL Server Management Studio (SSMS). To exclude SSMS output time, navigate to Tools, Options, Query Results, SQL Server, Results to Grid (or Results to Text), and enable the Discard results after execution option.

Cursor-Based Solution
Listing 2 shows a straightforward cursorbased solution for the task. The cursor is based on a query that sorts orders by CustomerID, OrderDate, and SalesOrderID. The code scans the cursor rows in this order and calculates Elapsed, Diff, and Pct values by comparing measures from the current row with those from the preceding row (except when the order is the customer’s first). For each order, the code inserts a row with the current order’s attributes and calculated values into a table variable. The code inserts NULL for the calculated column values when the order is the first for the customer (i.e., when the current customer ID is different than the previous customer ID or when the current customer ID is NULL).

This solution scans the leaf level of the covering index idx_cid_od_oid_i_td once in an ordered, forward fashion. The index’s leaf level consists of 3733 pages. In I/O cost, the solution is very efficient because the code must scan all orders at least once anyway. However, cursors have a high performance overhead because of their record-by-record manipulation. For example, this solution ran in about 58 seconds on my system (against the table with 1 million rows and without considering the time it takes SSMS to generate the output). Two main elements in this cursor solution contribute to the lengthy runtime: the overhead of the record-by-record manipulation and the need to spool the calculated values in a table variable. You probably want to start exploring set-based approaches to find a faster solution.

Set-Based Solutions
Set-based solutions don’t incur the overhead of the cursor solution’s record-by-record manipulation, and they don’t require you to store the intermediate calculations in a temporary table. In addition, set-based solutions let the query optimizer generate multiple execution plans, from which it can select the most efficient plan; with cursor code, you force a very rigid plan. Still, cursor-based solutions have one important advantage over set-based solutions: They can rely on sorted data. Consequently, cursor-based solutions typically incur less I/O than set-based solutions for the same task.

In my experience, set-based solutions tend to be faster than cursor-based solutions. The first set-based solution I come up with for a given problem often isn’t the fastest solution-sometimes it’s even slower than the cursor-based solution. But with some tweaking, I usually can develop a set-based solution that’s dramatically faster than the fastest cursor solution I can devise. Such is the case here.

   Prev. page   [1] 2 3     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.