Good indexes can help SQL Server process your JOIN queries
Indexes on a table can help you access one or many rows of data. Having a good index for your query is one of the best ways to improve performance. When your query is attempting to find just a few rows in a very large table, the performance difference between having no usable index on the table and having a good index can be enormous. When you're tuning queries that involve multiple tables (i.e., joins), indexes can also help SQL Server efficiently find the matching rows between tables.
The SQL Server query optimizer can choose from three strategies for performing joins: nested-loop join, merge join, and hash join. In this article, I describe the first two join strategies: nested loop and merge. (For more information about join strategies, see Itzik Ben-Gan and Kalen Delaney, "Advanced JOIN Techniques," December 1999, InstantDoc ID 7443.) In each method, the tables you want to join (or subsets of tables that you've restricted with conditions in the WHERE clause) are the join inputs. If your query involves WHERE conditions in addition to the join, SQL Server might apply the WHERE condition before trying to find the matching rows in the second table. Consider example Query 1 from the Northwind database:
-- Query 1:
SELECT LastName, FirstName, OrderID, OrderDate
FROM Employees e JOIN Orders o
ON e.EmployeeID = o.EmployeeID
This join returns all the orders from the Orders table; for each order, it also returns the last name and first name of the employee who took the order and the order date. For Query 1, SQL Server examines all the rows in the Employees and the Orders tables, and the join inputs are the entire tables. However, if you qualify this SELECT statement with two WHERE conditions, as Query 2 shows, the join inputs are no longer complete tables:
-- Query 2:
SELECT LastName, FirstName, OrderID, OrderDate
FROM Employees e JOIN Orders o
ON e.EmployeeID = o.EmployeeID
WHERE OrderDate < '1996-12-01'
AND LastName < 'D'
For Query 2, the join inputs are much smaller than in Query 1. Instead of joining the 9 rows in Employees with 830 rows in Orders, SQL Server must join only 2 rows from Employees with 121 rows from Orders. With a small number of input rows, the query optimizer frequently chooses a different join strategy than it would if the inputs were larger, and it might even choose to join the tables in a different order. As far as the optimizer's join strategy decisions are concerned, size does matter.
Nested Loops
Even if your query joins more than two tables, SQL Server performs the joins by joining only two inputs at a time, and each join in a query might use a different join strategy. The most straightforward type of joinand the type of join most people think about when planning join operationsis the nested-loop join. You can imagine SQL Server operating on the two inputs as if they were arrays in a development language like C or Basic. SQL Server compares each row in one of the inputs with all the rows in the other input to find the matches between the rows. Query 1 tries to find the rows that match on the EmployeeID column. So with the nested-loop join strategy, SQL Server must compare all the Employee ID values in one table with all the EmployeeID values in the other.
The worst-case scenario for a nested-loop join happens when no existing indexes can help SQL Server find the matching rows between the inputs and no indexes can help find the rows that satisfy any WHERE conditions. In this case, the inputs are the entire tables. The query optimizer chooses one table to be the outer table and accesses its rows first. Let's assume the outer table has P1 pages and R1 rows. The second table, the inner table, has P2 pages. SQL Server must read all the pages in the outer table; for each qualifying row on each page, it must then read all the pages from the inner table. To find the number of pages that SQL Server needs to read to produce the results, you can use the following formula:
P1 + R1 * P2
Even if the tables are relatively small, the resulting number of pages to be read gets large very quickly. For an outer table with only 200 pages and 4000 rows (i.e., 20 rows per page) and an inner table with 100 pages, the result is quite large. Tables with 100 or 200 pages aren't unusually big tables, but to process the join, SQL Server would need to access more than 400,000 pages if the tables have no usable indexes.
Indexes can help improve the performance of a nested-loop join in several ways. The biggest benefit often comes when you have a clustered index on the joining column in one of the tables. The presence of a clustered index on a join column frequently determines which table SQL Server chooses as the inner table. If the inner table has a clustered index, SQL Server doesn't have to look through every row of that table. The clustered index leads SQL Server directly to the rows in the inner table that have a join column value that matches the current row in the outer table. So in the formula, instead of the R1 * P2 term, which implies that SQL Server accesses all P2 pages, you can replace P2 with two or three page accesses, depending on how many levels the clustered index has. So the example with 200 pages and 4000 rows in the outer table and 100 pages in the inner table results in 200 + 4000 * 3, or 12,200, page readsa big improvement over 400,000.
Prev. page  
[1]
2
3
next page