• subscribe
March 22, 2005 12:00 AM

T-SQL Tiebreakers

Get the right query results
SQL Server Pro
InstantDoc ID #45235
Downloads
45235.zip

A tiebreaker is a rule used to determine precedence between candidates when there's a tie. For example, when soccer teams have the same number of points in the league, the one with the higher goal difference takes precedence. A tiebreaker in T-SQL is a rule that determines which row your query will return if multiple rows have the same values for a given set of attributes. For example, suppose you were asked to return the most recent order for each employee from the Orders table. Returning the most recent order date for each employee requires a simple GROUP BY query with MAX(orderdate), but you also need to return other attributes besides the employee ID and the order date.

For this example, suppose you need to return the order ID, customer ID, and shipper ID of the most recent order for each employee. This request by itself is problematic because an employee row might have multiple orders with the same order date, yet you need to return one order for each employee. The request is non-deterministic because SQL Server can consider multiple rows to be the most recent for an employee according to the order date alone. Here's where you need a tiebreaker: a rule identifying a unique row for each employee. Let's look at four solutions to this problem; as we go along, I compare them in terms of their complexity and performance and recommend when to use each.

To test your solutions for logical correctness, you can use the Northwind database's Orders table. But you'll need a larger table to test performance. Run the code in Listing 1 to create and populate a large Orders table in tempdb. Listing 1 creates a table with 830,000 orders, handled by 100 different employees over a period of 365 days.

The solutions' performance might vary depending on the table size and the density of the employee IDs and dates. You can adapt the numbers in Listing 1 to generate density that more closely resembles your needs and test the performance of the different solutions.

Solutions That Use ANSI Subqueries
Before we examine different tiebreaking solutions, let's look at what happens when you don't specify a tiebreaker. Listing 2's ANSI-compliant solution uses a correlated subquery to filter orders that have an order date equal to the highest order date among the orders with the same employee ID as the one in the outer row. Figure 1 shows the result of this query against Orders in Northwind.

Notice that, in the output for employee 2, two orders have the same most-recent order date (1998-05-05). To return only one order for each employee, you need to choose a tiebreaker that, together with the order date, uniquely identifies a row for a given employee. For example, among the orders with the latest order date for employee 2, you could return the row with the highest order ID (11073). Listing 3 shows the revised query, which uses MAX(OrderID) as the tiebreaker; Figure 2 shows the query results.

The query uses two nesting levels of correlated subqueries to filter the desired orders. The outer query against Orders (O1) returns the rows with an order ID equal to the highest order ID from Orders (O2) among the rows with the same employee ID as in O1, and an order date equal to the highest order date from Orders (O3) among the rows with the same employee ID as in O1.

The recommended index for this solution, as well as most other solutions for this task, is a covering index on correlation column, sort column, tiebreaker column, and covered columns. For this example, the index would be on EmployeeID, OrderDate, OrderID, CustomerID, and ShipVia. EmployeeID is the correlation (group) column, OrderDate is the sort column, OrderID is the tiebreaker column, and I included CustomerID and ShipVia in the index for covering purposes. Remember that a covering index is an index that contains all columns referenced in the query, including attributes beyond the search keys.

This solution performs well (if you created the recommended index), it's ANSI-compliant, and its complexity level is low—so far. However, adding tiebreakers degrades performance and makes the query too complex. For example, suppose you had to use MAX(CustomerID) as the first tiebreaker, MAX(ShipVia) as the second, and MAX(OrderID) as the third and last. You can't use MAX(CustomerID) alone as a tiebreaker because EmployeeID, OrderDate, and CustomerID don't uniquely identify a row. The same applies when you add MAX (ShipVia) as the second tiebreaker. Only when you add the third—MAX (OrderID)—can you isolate one row for each employee. To revise the query, simply replace the filter in the query against O2 at callout A in Listing 3 with the filter in Listing 4, which specifies three tiebreakers. For each tiebreaker, you specify another correlated subquery, and in each such subquery, you have to correlate by the employee ID and all preceding tiebreakers.

The recommended index is on EmployeeID, OrderDate, CustomerID, ShipVia, and OrderID. EmployeeID is the correlation (group) column, OrderDate is the sort column, and CustomerID, ShipVia, and OrderID are the tiebreaker columns. The revised query is the slowest solution among the ones I present here. It ran on my laptop for more than 2 minutes against the sample data. All the other solutions ran for less than 10 seconds. Also, its complexity level is high. Unless there's a compelling advantage, I try to avoid using highly complex queries in production code.



ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here