The difference can dramatically affect query performance
As a follow-on to last month's discussion of joins ("Data Retrieval with Joins," June 1999), this month I examine subqueries. You will see how, in some cases, you can replace subqueries with joins for more efficient code. Also, I examine simple and correlated subqueries and explain why a correlated subquery can drastically affect query performance.
What Is a Subquery?
A subquery is a query that SQL Server must evaluate before it can process the main query. Consider an example from the Pubs database. If someone asks, "Which books cost more than the average price?" the first step in finding the answer is to determine the average price of all the books. Only then can you look through the Titles table and find the books that cost more than the average. Screen 1 shows this query as a SQL statement and the Results Grid.
You can use a subquery anywhere you can use an expression. Always enclose a subquery in parentheses, so that the query parser can tell where the subquery begins and ends. In the SQL statement in Screen 1, the subquery is
(SELECT AVG(price) FROM titles)
The parser evaluates the subquery and substitutes the result into the main query. Running the subquery for this example gives you a value of $14.77. Therefore, the main query is asking
SELECT title, price
FROM titles
WHERE price > $14.77
In this simple subquery, the query optimizer evaluates the subquery once and uses the result to evaluate the outer query. In other words, the result is one value. For a slightly more complex example, say you want a list of books sold on a particular date. The Sales table refers only to the title_id, so you first have to get the list of title_ids for books sold on that date and then relate those title_ids to the titles in the Titles table. Screen 2 shows the SQL statement and the Results Grid for this query.
If you run the subquery
(SELECT DISTINCT title_id FROM SALES WHERE ord_date = 'Sep 14 1994')
separately, you get three title_id valuesBU1032, MC3021, and PS2091for books sold on the date specified. In other words, those books have entries in the Sales table for that date. So now the main query evaluates the SQL statement as if it reads:
SELECT title_id, title
FROM titles
WHERE title_id IN ('BU1032', 'MC3021', 'PS2091')
Although the subquery returns three values, they are all from the title_id column. This happens because a SELECT statement within another SELECT statement cannot return values to the outer SELECT statement from more than one column. It must return either a single value, a list of values from a single column, or a true or false (which is itself a single value). Running this query as a JOIN statement might be more efficient. This JOIN statement would be
SELECT DISTINCT titles.title_id, title
FROM titles JOIN sales ON titles.title_id = sales.title_id
WHERE ord_date = 'Sep 14 1994'
Screen 3 shows the SQL statement and the Execution Plan output for the subquery. The Execution Plan shows how the optimizer processes the query. If you compare the Execution Plan for the subquery and JOIN statements, you will see that the optimizer processes these queries in an identical manner. Keep in mind that these examples are simple queries against a very small database. The optimizer might handle the queries differently when dealing with a large database.
Correlated Subqueries
From a programmer's perspective, think of a subquery as a subroutine that the optimizer must run before the main routine can complete. A correlated subquery is one that depends on a value in the outer query. In programming terms, you pass the subroutine an argument, then the subroutine evaluates the query and returns a result. You move on to the next record and repeat the process. The net effect is that the subquery runs once for every row in the main query; this situation is inefficient.
A possible alternative is to rewrite the correlated subquery as a join. However, some situations require a subquery.
Suppose you want a list of all the authors in the Pubs database who have written books. Because you're not concerned with how many books each author has written or what the titles are, you can use the WHERE EXISTS command to see whether the author has an entry in the Titleauthor table. Screen 4 shows this SQL statement, which returns a list of 19 of 23 authors who have written books. Notice that in the subquery, the SELECT statement uses the asterisk, which ordinarily means return all columns. But the convention with EXISTS is that you use an asterisk, rather than specifying a column, because you do not expect SQL Server to return any data valuesEXISTS returns only Boolean values (i.e., true or false). What makes this SELECT statement a correlated subquery is that the optimizer evaluates it for each au_id in the Authors table. The optimizer passes each au_id value in the outer query to the inner query, where the optimizer uses that value to run the SELECT statement. The correlated subquery will not run alone, unlike the simple subqueries you looked at earlier. If you try to run a correlated subquery alone, SQL Server tells you that you have referenced the authors.au_id column without specifying the Authors table in the query.