• subscribe
May 19, 2009 12:00 AM

Set-Based vs. Cursor-Based Solutions for Running Aggregates

Use cursors for large partitions
SQL Server Pro
InstantDoc ID #101736
Executive Summary:
SQL Server’s optimizer handles set-based solutions to running aggregates with quadratic complexity (n2)with respect to partition size, whereas it handles cursor-based solutions with linear complexity. So in terms of performance, when dealing with small partition sizes—as many as a few hundred rows per partition—you’re better off with a set-based solution. However, when dealing with large partitions—more than 500 rows—you’re better off with a cursor-based solution.

Running aggregations are calculations that continue to accumulate measures over a sequence—typically temporal, and possibly within partitions. An example of obtaining a running aggregate is finding a running sum of quantity for each employee and month, from a table that holds a row for each employee and month with measures such as quantity and value. That is, for each employee and month, you return the sum of all quantities from the beginning of the employee’s activity until the current month.

Last month I started a series of articles that explore the performance aspects and algorithmic complexity of different solutions to running aggregations. I covered two set-based solutions—one using subqueries and another using joins. (See "Subqueries and Joins for Running Aggregates.") This month I cover a solution based on cursors.

Some developers and database administrators consider cursors to be evil, even going so far as to recommend avoiding them at all costs. Indeed, set-based solutions have many advantages over cursor-based solutions in terms of being more aligned with the relational model, requiring less code, being more readable, involving less maintenance, and typically performing better. However, cursors perform better than set-based solutions in some cases. In this article I show that under certain circumstances, cursor-based solutions to running aggregations provide better performance than set-based solutions.

Sample Data
In this article I use the same Sales table as last month. Run the code in Listing 1 to create the table. The Sales table contains a row for each employee and month. The definition of the Sales table contains an attribute for the employee ID (empid), month (dt), quantity (qty), and value (val). The primary key (as well as the clustered index) is defined on the key list (empid, dt). The clustered index follows the indexing guidelines to support solutions to running aggregates—the key list should be the partitioning column(s) followed by the ordering column(s), and the aggregated measures should be covered by the index as well.

Run the code in Listing 2 to create the helper function GetNums, which returns a table of numbers with a requested number of rows. Next, run the code in Listing 3 to populate the Sales table with the desired number of partitions (employees) and the desired partition size (rows per employee).

As a basis for comparison, I’ll use last month’s set-based solution using subqueries, which Listing 4 shows. The query in Listing 4 returns a running sum of quantity for each employee and month.

Cursor-Based Solution
Listing 5 shows the cursor-based solution for our running sum request. The solution is fairly straightforward. The code defines a table variable called @Result where the result will be stored, as well as a few local variables where the values from each row will be stored. The code declares a cursor based on a query that sorts the data by empid and dt. Based on this ordering, the code fetches the cursor records one at a time. In each iteration of the loop, the code handles the current row. If the current row’s empid value is different than the previous row’s value, this indicates that the current row is the first in the partition (for the employee). In such a case, the code initializes the variable holding the current running aggregate (@sumqty) with 0. The code then adds the current quantity to @sumqty and inserts a row with the current details and the running sum into the table variable @Result. Finally the code queries the table variable to return the result.

The clustered index created on the attributes (empid, dt) as the key list is ideal for this solution. Figure 1 shows the execution plan for the query that the cursor is based on. As you can see in the plan, SQL Server performs an ordered scan of the index.



ARTICLE TOOLS

Comments
  • dianagele
    1 year ago
    May 21, 2011

    loved it.
    note: Listing 2 has a syntax in the WHERE statement?
    this line --> "SELECT n FROM Nums WHERE n"

  • WAYNE
    3 years ago
    Jun 04, 2009

    This set based method is using an inefficient triangular-join. There are other set-based methods that will blow this (and, obviously, the cursor) away.
    See http://www.sqlservercentral.com/Forums/FindPost728664.aspx for a set-based example of a one million row running update that runs in 35 seconds, and a corresponding cursor-based solution that runs in 144 seconds.

    [Reply:
    Hi wgshef,

    Thanks for your comment. The technique that you refer to as a more efficient set-based solution is one that I wouldn’t recommend using. I don’t have enough room to elaborate here; please see my response in the following blog entry: http://www.sqlmag.com/Article/ArticleID/102251/sql_server_102251.html.

    Cheers,
    BG]

You must log on before posting a comment.

Are you a new visitor? Register Here