• subscribe
March 28, 2008 12:00 AM

Set Operations

SQL Server 2005 Tips
SQL Server Pro
InstantDoc ID #98159
Downloads
98159.zip

Executive Summary:

SQL Server 2005 supports three set operations, called UNION, INTERSECT, and EXCEPT. Learn how to circumvent unsupported logical phases, work around SQL Server 2005’s lack of INTERSECT ALL and EXCEPT ALL, and take advantage of the INTERSECT and EXCEPT operations’ treatment of NULLs.


Set operations are fundamental operations in set theory that let you unify sets, figure out the intersection of sets, and determine which elements appear in one set but not the other. SQL Server 2005 supports three set operations, called UNION, INTERSECT, and EXCEPT. A set operation operates on the result sets of two queries (call them query1 and query2), comparing complete rows. The general form of a set operation is

 query1
  set operation: {UNION | INTERSECT |
  EXCEPT}
  query2
  [ORDER BY ...]

The set operation you use and the result of the row comparisons determines whether a source row appears in the result of the set operation. UNION will return a row if it appears in either input set, INTERSECT will return a row if it appears in both input sets, and EXCEPT will return a row if it appears in the first set but not the second.

In this article I provide a few set operation tips that you might not be aware of, including circumventing unsupported logical phases, INTERSECT ALL and EXCEPT ALL, and treatment of NULLs. I assume you’re already familiar with the fundamentals of set operations; for more information about set operations, see the Learning Path, page 28.

Circumventing Unsupported Logical Phases
As you can see from the general form of a set operation that I provided, only one logical query processing phase is allowed on the result of a set operation—ORDER BY. The syntax of a set operation doesn’t allow applying any other logical phases to the result (e.g., table operators, WHERE, GROUP BY, HAVING, TOP). However, you can use a simple trick to get around this restriction—define a table expression (e.g., common table expression—CTE—or derived table) based on the query with the set operation, then have the outer query against the table expression apply any logical phases/query clauses that you need. For example, if you run the query that Web Listing 1 shows in the Northwind database, it returns customer-employee pairs that had order activity in both 1996 and 1997.

Suppose you need to group the result by employee, and count the number of customers each employee handled. This can be achieved easily with a table expression such as the one that Web Listing 2 shows. In a similar manner, you can apply any other logical phase to the table expression C.

Going back to the general form of the set operation, query1 and query2 aren’t allowed to have ORDER BY clauses because by definition a set operation operates on unordered sets, and an ORDER BY clause would impose order. This restriction applies even when you use the TOP option. Suppose you need to apply a set operation to the result sets of two queries defined with a TOP option and an ORDER BY clause. To accomplish this task, you must define two table expressions based on the TOP queries, and have the set operation applied to two queries against the table expression, as Web Listing 3 shows. This query returns shipping destinations (country, region, city) that were in the top five shipping destinations in 1997 (based on number of orders) but weren’t in the top five in 1996.

INTERSECT ALL and EXCEPT ALL
ANSI SQL supports two versions of each set operation— a DISTINCT version and an ALL version. SQL Server 2005 implements both versions of the UNION set operation, but only the DISTINCT version of INTERSECT and EXCEPT. In this section I explain what the two versions mean with each set operation, and I present alternatives you can use for the missing ALL versions.

The DISTINCT version of INTERSECT means that as long as a row appears at least once in each set, the row will appear once in the output. For example, the following query returns one occurrence of each location (country, region, city) that has at least one customer and one employee:

 SELECT Country, Region, City FROM
  dbo.Customers
  INTERSECT
  SELECT Country, Region, City FROM
  dbo.Employees;

This query returns three rows in the output, where each qualifying location appears only once.

ANSI SQL also supports an INTERSECT ALL set operation that wasn’t implemented in SQL Server 2005. INTERSECT ALL returns as many occurrences of a row as the minimum number of times that it appears in any set. If row R appears m times in the result set of query1, and n times in the result set of query2, R will appear a minimum number of times (m, n) in the result of the set operation. For example, if R appears three times in one set and five times in the other, R will appear three times in the result of the set operation. You typically need to use INTERSECT ALL if you want to pair each instance of a row from one set with a different instance of the row in the other. Using our last query as an example, say that each employee is supposed to handle only one customer from the same location, and you need to check how many customers can be matched with an employee from the same location. In this case, you’d need INTERSECT ALL, as in the following code (don’t run this query):

 SELECT Country, Region, City FROM
  dbo.Customers
  INTERSECT ALL
  SELECT Country, Region, City FROM
  dbo.Employees;

Unfortunately, SQL Server 2005 doesn’t support this standard INTERSECT ALL syntax. A simple alternative is to assign row numbers to number the occurrences of each location, and issue the INTERSECT operation between the two sets that include the row numbers, as Web Listing 4 shows.

As an example, the location UK, NULL, London appears six times in Customers and four times in Employees. The query against Customers assigns row numbers 1 through 6 to the occurrences of this location, and the query against Employees assigns row numbers 1 through 4 to the occurrences of this location. Four occurrences of this location (with row numbers 1 through 4) intersect; therefore this location will appear four times in the output.

Similarly, ANSI SQL defines a set operation called EXCEPT ALL that isn’t implemented in SQL Server 2005. If row R appears m times in the result of query1 and n times in the result of query2, R will appear a maximum number of times (m-n, 0) in the result of query1 EXCEPT ALL query2. For example, a location L that appears six times in the result of query1 and four times in the result of query2 will appear two times in the result of query1 EXCEPT ALL query2, and zero times in the result of query2 EXCEPT ALL query1.

You typically need EXCEPT ALL when you want to figure out how many instances of a row from one set can’t be paired with a different instance of the row in the other set. Using our last query as an example, say that each employee is supposed to handle only one customer from the same location, and you need to check how many customers can’t be matched with an employee from the same location.

To achieve the logical equivalent of the missing EXCEPT ALL operation, you can use the same trick with the row numbers I showed earlier, only this time you use the EXCEPT operation instead of INTERSECT, as Web Listing 5 shows.

As I mentioned earlier, the location UK, NULL, London appears six times in Customers and four times in Employees. The occurrences of this location are numbered 1 through 6 in the first set and 1 through 4 in the second. Only occurrences 5 and 6 from the first set don’t find a match in the second; therefore in the 85 output rows you’ll find only two occurrences of UK, NULL, London.

Continued on page 2



ARTICLE TOOLS

Comments
  • Swati
    4 years ago
    Apr 28, 2008

    excellent. useful.

  • Rho Trading
    4 years ago
    Apr 07, 2008

    Exceptionally useful, even beyond Itzik's usual excellence!

  • meganbearly
    4 years ago
    Apr 07, 2008

    Hi gitega,
    Thanks so much for your feedback. Below is Itzik Ben-Gan's response to your questions.

    “When the comparison columns are all that you need to return from a query, then yes, you can consider EXCEPT as an alternative to outer joins (or the NOT EXISTS predicate).
    Just remember what I explained in the article in terms of the difference between EXCEPT and the alternatives in terms of NULL comparisons.

    If you need to return also columns that you’re not comparing, a join gives you this option while a set operation doesn’t.

    Cheers,
    Itzik”

    Please feel free to contact me if you have additional questions.

    Megan Bearly
    Associate Editor, SQL Server Magazine
    mbearly@penton.com

  • Lorrin
    4 years ago
    Apr 04, 2008

    Great article!

    How would the outputs of EXCEPT statement differ from a left join when looking for rows that need to be inserted into target table ? OR Stated another way, is the EXCEPT statement an alternative to a LEFT JOIN in this use case (identify rows to insert to a target table)?

You must log on before posting a comment.

Are you a new visitor? Register Here