Save time and improve performance
The SQL Server query optimizer uses a variety of algorithms and new join techniques to find the best available plan for your queries. In most cases, you can rely on its decisions. This article delves into SQL Server 7.0's new internal mechanisms for processing joins. Understanding the new algorithms available to the query processor will help you understand how the query processor works, and help you to see why you don't have to create indexes for one-time ad hoc queries. We also cover join hints, which let you take control from the query optimizer and force a certain join algorithm or table order. Join hints can be useful in the rare scenarios in which the query optimizer doesn't make the best decision. You can also use join hints to test any join algorithm's efficiency. For tips on maximizing the optimizer's performance, see "Performance Considerations," page 40.
Join Strategies
By choosing effective indexes and join strategies, the query optimizer builds an efficient plan to retrieve data. The number of logical reads and the amount of CPU time required generally determine the cost of a query plan. The fewer logical reads and the less CPU use, the lower the cost of the query plan. Three join strategies are available to the query processor: nested loops, merge joins, and hash joins. Nested-loop join was the only strategy available in previous releases of SQL Server. Microsoft introduced merge and hash joins primarily to accommodate very large databases (VLDBs) and the special needs of data warehouses. These join types can exploit large amounts of memory and the increased processing power of modern servers.
Every join operation involves two inputs, each of which is a table or a subset of data in a table. An input might not be an entire table, because you can use a WHERE clause to filter the rows in the table, or use GROUP BY to summarize them. In that case, the result of the filtering or summarizing is the input to the join operation.
Most examples in this article use a database that contains information about products, orders, and customers. The schema in Screen 1 shows the relationships of four tables we refer to.
Nested-Loop Joins
If the optimizer determines that a nested-loop join is the optimal join, it selects one table as the outer (first) table and the other as the inner table. When executing a nested-loop join, SQL Server scans the outer table row by row. For each row, it also scans the inner table, looking for matching rows. This join type is very efficient if one of the tables is small (or has been filtered so that it has only a few qualifying rows) and if the other table has an index on the column that joins the tables. The optimizer typically chooses the small table as its first input because this is the fastest method of executing the nest-loop algorithm.
For the nested-loop strategy to be effective, the inner table needs an index on the join column. This index means SQL Server doesn't need to scan the entire inner table; after accessing a row in the outer table, SQL Server can use the index on the join column to quickly find all matching rows in the inner table.
If the inner table doesn't have a useful index on the join column, the optimizer usually chooses a hash join strategy. However, if the join between the tables isn't based on equality, the optimizer chooses a nested-loop join, regardless of the tables' sizes and indexing schema.
Listing 1 contains the code to create and populate four tables. Let's look at the following query on the tables:
SELECT C.CustomerName, O.OrderDate
FROM Customers C INNER JOIN
Orders O
ON C.CustomerID = O.CustomerID
Screen 2 shows the nested-loop query execution plan. The outer loop iterates through the rows in the Orders table. Note that because a clustered index contains the data, when the execution plan calls for a clustered index scan, the query processor scans the entire table. In the absence of a clustered index, this scan is called a table scan. For each row in the Orders table, the query processor uses clustered index seek, which scans a particular range of rows from a clustered index to seek a matching row in the Customers table because the join column CustomerID has a clustered index.
The query optimizer distinguishes among three variants of nested-loop joins, according to Books Online (BOL). A search that scans an entire table or index is a naïve nested-loop join. A search that performs lookups in an index to fetch rows is an index nested-loop join (which Screen 2 shows). When SQL Server builds an index as part of the query plan and destroys it when the query completes, you have a temporary index nested-loop join. If you notice the Query Analyzer using a temporary index nested-loop join for a common join, you might want to consider running the Index Tuning Wizard to determine whether you could benefit from adding an index to the base table.
Merge Joins
If the optimizer decides that a merge join is optimal, the query execution scans two sorted inputs and merges them. For a merge join, the query processor sorts both inputs on the join column; if the two inputs to the join are already sorted on the join column (for example, if each has a clustered index on the join column), merge join is a logical choice. In other cases, if adequate memory is available, the optimizer might determine that sorting one input before joining inputs results in the most efficient join strategy.
If a one-to-many relationship exists between the inputs, SQL Server follows a series of steps for an inner join executed with a merge join strategy. First, it gets a row from each input, then compares the join columns of the rows. If the join columns match, SQL Server returns the requested columns from each row. If the columns don't match, SQL Server discards the row with the lower value and gets the next row from that input. SQL Server repeats these steps until it has processed all rows in both inputs, scanning both inputs only once.
To demonstrate a situation in which the optimizer will choose merge join, we inserted 10,000 rows into the Orders table. For each Order, we inserted five products into the OrderDetails table, resulting in 50,000 rows. Both Orders and OrderDetails have clustered indexes on the OrderID column. Then we executed the following query against the tables:
SELECT O.OrderID,
O.OrderDate,
OD.ProductID,
OD.Quantity
FROM Orders O INNER JOIN
OrderDetails OD
ON O.OrderID =
OD.OrderID
Screen 3 shows the execution plan: SQL Server scans both the Orders table and the OrderDetails table. For each row in Orders, the matching rows in OrderDetails are merged.
Prev. page  
[1]
2
3
next page