Executive Summary:
SQL Server’s optimizer handles set-based solutions to running aggregates with quadratic complexity (n2)with respect to partition size, whereas it handles cursor-based solutions with linear complexity. So in terms of performance, when dealing with small partition sizesas many as a few hundred rows per partitionyou’re better off with a set-based solution. However, when dealing with large partitionsmore than 500 rowsyou’re better off with a cursor-based solution.
|
Running aggregations are calculations that continue to accumulate measures over a sequencetypically temporal, and possibly within partitions. An example of obtaining a running aggregate is finding a running sum of quantity for each employee and month, from a table that holds a row for each employee and month with measures such as quantity and value. That is, for each employee and month, you return the sum of all quantities from the beginning of the employee’s activity until the current month.
Last month I started a series of articles that explore the performance aspects and algorithmic complexity of different solutions to running aggregations. I covered two set-based solutionsone using subqueries and another using joins. (See "Subqueries and Joins for Running Aggregates.") This month I cover a solution based on cursors.
Some developers and database administrators consider cursors to be evil, even going so far as to recommend avoiding them at all costs. Indeed, set-based solutions have many advantages over cursor-based solutions in terms of being more aligned with the relational model, requiring less code, being more readable, involving less maintenance, and typically performing better. However, cursors perform better than set-based solutions in some cases. In this article I show that under certain circumstances, cursor-based solutions to running aggregations provide better performance than set-based solutions.
Sample Data
In this article I use the same Sales table as last month. Run the code in Listing 1 to create the table. The Sales table contains a row for each employee and month. The definition of the Sales table contains an attribute for the employee ID (empid), month (dt), quantity (qty), and value (val). The primary key (as well as the clustered index) is defined on the key list (empid, dt). The clustered index follows the indexing guidelines to support solutions to running aggregatesthe key list should be the partitioning column(s) followed by the ordering column(s), and the aggregated measures should be covered by the index as well.
Run the code in Listing 2 to create the helper function GetNums, which returns a table of numbers with a requested number of rows. Next, run the code in Listing 3 to populate the Sales table with the desired number of partitions (employees) and the desired partition size (rows per employee).
As a basis for comparison, I’ll use last month’s set-based solution using subqueries, which Listing 4 shows. The query in Listing 4 returns a running sum of quantity for each employee and month.
Cursor-Based Solution
Listing 5 shows the cursor-based solution for our running sum request. The solution is fairly straightforward. The code defines a table variable called @Result where the result will be stored, as well as a few local variables where the values from each row will be stored. The code declares a cursor based on a query that sorts the data by empid and dt. Based on this ordering, the code fetches the cursor records one at a time. In each iteration of the loop, the code handles the current row. If the current row’s empid value is different than the previous row’s value, this indicates that the current row is the first in the partition (for the employee). In such a case, the code initializes the variable holding the current running aggregate (@sumqty) with 0. The code then adds the current quantity to @sumqty and inserts a row with the current details and the running sum into the table variable @Result. Finally the code queries the table variable to return the result.
The clustered index created on the attributes (empid, dt) as the key list is ideal for this solution. Figure 1 shows the execution plan for the query that the cursor is based on. As you can see in the plan, SQL Server performs an ordered scan of the index.