Many T-SQL query problems involve matching rows based on their location in
a sequence—typically, a temporal sequence. A classic example of such a
problem is matching each customer’s current and preceding orders to compare
measures and analyze trends. There are several possible approaches to a solution.
Here, I compare the performance and I/O costs of a cursor solution and two set-based
solutions for matching current and preceding rows and share some query optimizer
tips along the way.
Exploring Customer Order Trends
To demonstrate the three solutions, I use the SalesOrderHeader table in the
AdventureWorks database. The query in Listing
1 returns the columns of interest for our problem, as Web
Table 1 shows in abbreviated form.
To determine trends in customer orders, suppose the sales manager asks you
to return the following information for each order:
- current customer ID
- current order date
- current sales order ID
- current total due
- the elapsed time in days between the order dates for the customer’s
preceding and current orders (call it Elapsed)
- the difference in total due for the current and preceding orders (call
it Diff)
- the percentage difference in total due for the current and preceding orders
(call it Pct)
To determine the preceding order, your code should consider OrderDate as the
first sort column and SalesOrderID as the second sort column. A customer can
have more than one order with the same OrderDate, and SalesOrderID sorting might
not reflect OrderDate sorting. Table 1 shows
the desired result (in abbreviated form) for this task.
Before you start developing your solution, create the following covering index
to speed up your code:
CREATE UNIQUE INDEX
idx_cid_od_oid_i_td
ON Sales.SalesOrderHeader
(CustomerID, OrderDate,
SalesOrderID)
INCLUDE(TotalDue);
Note that I defined the TotalDue column as an included non-key column in the
index because it doesn’t serve any sorting or filtering purpose—it’s
included just for covering purposes.
You can use the SalesOrderHeader table in the AdventureWorks database to check
the logic and accuracy of your solution by comparing the result you get with
the desired result shown in Table 1. However,
this table is too small for performance tests because it contains only 31,465
rows. For performance testing, run the code in Web Listing
1 to create in tempdb a SalesOrderHeader table that contains 1 million rows.
Make sure you set the database context to AdventureWorks when checking the validity
of your solution and to tempdb when testing performance.
All the performance measures I present are based on the larger table in tempdb.
My performance measures also focus on server processing time by excluding the
time it takes to generate the output in SQL Server Management Studio (SSMS).
To exclude SSMS output time, navigate to Tools, Options, Query Results, SQL Server, Results
to Grid (or Results to Text), and enable the Discard results after execution
option.
Cursor-Based Solution
Listing 2 shows a straightforward cursorbased solution for the task. The cursor
is based on a query that sorts orders by CustomerID, OrderDate, and SalesOrderID.
The code scans the cursor rows in this order and calculates Elapsed, Diff, and
Pct values by comparing measures from the current row with those from the preceding
row (except when the order is the customer’s first). For each order, the
code inserts a row with the current order’s attributes and calculated
values into a table variable. The code inserts NULL for the calculated column
values when the order is the first for the customer (i.e., when the current
customer ID is different than the previous customer ID or when the current customer
ID is NULL).
This solution scans the leaf level of the covering index idx_cid_od_oid_i_td
once in an ordered, forward fashion. The index’s leaf level consists of
3733 pages. In I/O cost, the solution is very efficient because the code must
scan all orders at least once anyway. However, cursors have a high performance
overhead because of their record-by-record manipulation. For example, this solution
ran in about 58 seconds on my system (against the table with 1 million rows
and without considering the time it takes SSMS to generate the output). Two
main elements in this cursor solution contribute to the lengthy runtime: the
overhead of the record-by-record manipulation and the need to spool the calculated
values in a table variable. You probably want to start exploring set-based approaches
to find a faster solution.
Set-Based Solutions
Set-based solutions don’t incur the overhead of the cursor solution’s
record-by-record manipulation, and they don’t require you to store the
intermediate calculations in a temporary table. In addition, set-based solutions
let the query optimizer generate multiple execution plans, from which it can
select the most efficient plan; with cursor code, you force a very rigid plan.
Still, cursor-based solutions have one important advantage over set-based solutions: They
can rely on sorted data. Consequently, cursor-based solutions typically incur
less I/O than set-based solutions for the same task.
In my experience, set-based solutions tend to be faster than cursor-based solutions. The
first set-based solution I come up with for a given problem often isn’t
the fastest solution-sometimes it’s even slower than the cursor-based
solution. But with some tweaking, I usually can develop a set-based solution
that’s dramatically faster than the fastest cursor solution I can devise.
Such is the case here.
Prev. page  
[1]
2
3
next page