Executive Summary:
SQL Server 2005 supports three set operations, called UNION, INTERSECT, and EXCEPT. Learn how to circumvent unsupported logical phases, work around SQL Server 2005’s lack of INTERSECT ALL and EXCEPT ALL, and take advantage of the INTERSECT and EXCEPT operations’ treatment of NULLs.
|
Set operations are fundamental operations in
set theory that let you unify sets, figure out the
intersection of sets, and determine which elements
appear in one set but not the other. SQL Server
2005 supports three set operations, called UNION,
INTERSECT, and EXCEPT. A set operation operates
on the result sets of two queries (call them query1 and
query2), comparing complete rows. The general form
of a set operation is
query1
set operation: {UNION | INTERSECT |
EXCEPT}
query2
[ORDER BY ...]
The set operation you use and the result of the row
comparisons determines whether a source row appears
in the result of the set operation. UNION will return
a row if it appears in either input set, INTERSECT
will return a row if it appears in both input sets, and
EXCEPT will return a row if it appears in the first set
but not the second.
In this article I provide a few set operation tips that
you might not be aware of, including circumventing
unsupported logical phases, INTERSECT ALL and
EXCEPT ALL, and treatment of NULLs. I assume
you’re already familiar with the fundamentals of set
operations; for more information about set operations,
see the Learning Path, page 28.
Circumventing Unsupported
Logical Phases
As you can see from the general form of a set
operation that I provided, only one logical query
processing phase is allowed on the result of a set
operation—ORDER BY. The syntax of a set operation
doesn’t allow applying any other logical phases to the
result (e.g., table operators, WHERE, GROUP BY,
HAVING, TOP). However, you can use a simple trick
to get around this restriction—define a table expression
(e.g., common table expression—CTE—or derived
table) based on the query with the set operation, then
have the outer query against the table expression
apply any logical phases/query clauses that you need.
For example, if you run the query that Web Listing 1 shows in
the Northwind database, it returns customer-employee
pairs that had order activity in both 1996 and 1997.
Suppose you need to group the result by employee,
and count the number of customers each employee
handled. This can be achieved easily with a table
expression such as the one that Web Listing 2 shows.
In a similar manner, you can apply any other logical
phase to the table expression C.
Going back to the general form of the set operation,
query1 and query2 aren’t allowed to have ORDER BY
clauses because by definition a set operation operates
on unordered sets, and an ORDER BY
clause would impose order. This restriction
applies even when you use the TOP option.
Suppose you need to apply a set operation
to the result sets of two queries defined with
a TOP option and an ORDER BY clause. To accomplish
this task, you must define two table expressions
based on the TOP queries, and have the set operation
applied to two queries against the table expression,
as Web Listing 3 shows. This query returns shipping
destinations (country, region, city) that were in the top
five shipping destinations in 1997 (based on number of
orders) but weren’t in the top five in 1996.
INTERSECT ALL and
EXCEPT ALL
ANSI SQL supports two versions of each set operation—
a DISTINCT version and an ALL version. SQL
Server 2005 implements both versions of the UNION
set operation, but only the DISTINCT version of
INTERSECT and EXCEPT. In this section I explain what the two versions mean with each set operation,
and I present alternatives you can use for the missing
ALL versions.
The DISTINCT version of INTERSECT means
that as long as a row appears at least once in each set,
the row will appear once in the output. For example,
the following query returns one occurrence of each
location (country, region, city) that has at least one
customer and one employee:
SELECT Country, Region, City FROM
dbo.Customers
INTERSECT
SELECT Country, Region, City FROM
dbo.Employees;
This query returns three rows in the output, where each
qualifying location appears only once.
ANSI SQL also supports an INTERSECT ALL set
operation that wasn’t implemented in SQL Server 2005.
INTERSECT ALL returns as many occurrences of a
row as the minimum number of times that it appears
in any set. If row R appears m times in the result set
of query1, and n times in the result set of query2, R
will appear a minimum number of times (m, n) in the
result of the set operation. For example, if R appears
three times in one set and five times in the other, R will
appear three times in the result of the set operation.
You typically need to use INTERSECT ALL if you
want to pair each instance of a row from one set with
a different instance of the row in the other. Using our
last query as an example, say that each employee is
supposed to handle only one customer from the same
location, and you need to check how many customers
can be matched with an employee from the same location.
In this case, you’d need INTERSECT ALL, as in
the following code (don’t run this query):
SELECT Country, Region, City FROM
dbo.Customers
INTERSECT ALL
SELECT Country, Region, City FROM
dbo.Employees;
Unfortunately, SQL Server 2005 doesn’t support this
standard INTERSECT ALL syntax. A simple alternative
is to assign row numbers to number the occurrences
of each location, and issue the INTERSECT
operation between the two sets that include the row
numbers, as Web Listing 4 shows.
As an example, the location UK, NULL, London
appears six times in Customers and four times in
Employees. The query against Customers assigns
row numbers 1 through 6 to the occurrences of this
location, and the query against Employees assigns
row numbers 1 through 4 to the occurrences of this
location. Four occurrences of this location (with row
numbers 1 through 4) intersect; therefore this location
will appear four times in the output.
Similarly, ANSI SQL defines a set operation called
EXCEPT ALL that isn’t implemented in SQL Server
2005. If row R appears m times in the result of query1
and n times in the result of query2, R will appear a
maximum number of times (m-n, 0) in the result of
query1 EXCEPT ALL query2. For example, a location
L that appears six times in the result of query1 and four
times in the result of query2 will appear two times in
the result of query1 EXCEPT ALL query2, and zero
times in the result of query2 EXCEPT ALL query1.
You typically need EXCEPT ALL when you want
to figure out how many instances of a row from one
set can’t be paired with a different instance of the row
in the other set. Using our last query as an example,
say that each employee is supposed to handle only one
customer from the same location, and you need to
check how many customers can’t be matched with an
employee from the same location.
To achieve the logical equivalent of the missing
EXCEPT ALL operation, you can use the same trick
with the row numbers I showed earlier, only this time
you use the EXCEPT operation instead of INTERSECT,
as Web Listing 5 shows.
As I mentioned earlier, the location UK, NULL,
London appears six times in Customers and four times
in Employees. The occurrences of this location are
numbered 1 through 6 in the first set and 1 through 4
in the second. Only occurrences 5 and 6 from the first
set don’t find a match in the second; therefore in the 85
output rows you’ll find only two occurrences of UK,
NULL, London.
Continued on page 2