Consider Query 1 again: If both Employees and Orders had clustered indexes on the EmployeeID column, SQL Server could perform a merge join. The pseudocode for SQL Server's execution of the merge join would look something like this:
GET one Orders row and one Employees row
DO (until one input is empty);
IF EmployeeID values are equal
Return values from both rows
GET next Orders row
ELSE IF Orders.EmployeeID <>
Employees.EmployeeID
GET next Employees row
ELSE GET next Orders row
The query optimizer usually chooses the merge join strategy when both inputs are already sorted on the join column. If the inputs are both already sorted, little I/O is required to process a merge join if the join is one-to-many (1:M). A many-to-many (M:N) merge join uses a temporary table to store rows instead of discarding them as it usually does. If the data includes duplicate values from each input, the second input must return to the start of the duplicates in the temporary table as SQL Server processes each duplicate from the first input. However, in many cases, SQL Server won't use a merge join unless at least one of the join columns is unique.
Here's an example of joining two tables with and without a unique index. Listing 1 creates copies of the Orders and Order Details tables in the Northwind database and builds a clustered index on OrderID in both tables. When you initially join the tables and display the query plan, you'll see that SQL Server chooses a nested-loop join. Although the OrderID column is unique in the Orders table, if you don't specify unique in the index definition, the optimizer isn't aware that the key values are unique. So when you rebuild the clustered index in the Orders table, specifying that it must be unique, the revised query plan shows that SQL Server uses a merge join.
In some cases, SQL Server's optimizer might decide that it's cheap enough to sort one of the inputs before the join, then perform a merge join. If you change Listing 1 slightly so that the initial index built on the Order Details table is nonclustered, as Listing 2 shows, the query plan shows a sort operation before a merge join.
Know Your Tables
Usually, the query optimizer determines the type of join SQL Server will use. You can just write your join query and leave the choice of strategy to SQL Server. However, knowing how SQL Server performs the joins can help you decide which indexes will be most useful. The optimizer usually chooses nested-loop join if one of the join inputs is very small compared with the other input and the larger input has a clustered index on the join column. Merge join is the most likely strategy if both inputs are sorted on the join column and, in particular, if one of the inputs has a unique clustered index.
Both nested-loop and merge joins require appropriate indexes on the tables. However, if your application lets users build ad hoc queries, you might not know in advance what the best columns for indexes are. The third type of join, hash join, lets SQL Server achieve pretty good join performance even when your tables have no useful indexes. In my next column, I'll discuss hash joins.
End of Article
Prev. page
1
2
[3]
next page -->