Executive Summary:
Some data modification tasks involve merging source data into a target table. Before Microsoft SQL Server 2008 you had to carry out such tasks with multiple modification statements, requiring you to access both the source and the target multiple times, and causing complexities in terms of handling the modifications as an atomic unit. Microsoft SQL Server 2008 adds the standard MERGE statement to simplify and optimize these tasks.
|
Some data modification tasks involve merging
source data into a target table. Such tasks are
common in both OLTP and data warehouse
environments. A simple example from an OLTP environment
is modifying a target Customers table with more
recent customer information from a source file. You’d
need to overwrite the attributes of existing target customers
with the newer information, as well as insert into
the target any customers in the source that don’t already
exist in the target.
Examples from a
data warehouse
environment
include handling
slowly changing
dimensions and merging fact additions and changes
into a target fact table as part of your daily extraction,
transformation, and loading (ETL) process.
Before SQL Server 2008 you had to carry out such
tasks with multiple modification statements, requiring
you to access both the source and the target multiple
times, and causing complexities in terms of handling
the modifications as an atomic unit. SQL Server 2008
adds the standard MERGE statement to simplify and
optimize these tasks.
MERGE Fundamentals
I’ll use a simple example to help familiarize you with
the basic syntax and concepts of the MERGE statement.
Suppose you have a target Customers table in
your OLTP environment, and you want to merge more
recent customer info from a staging table called CustomersStage
into the Customers table. Run the code
in Web Listing 1 (www.sqlmag.com, InstantDoc ID
97963) to create the Customers and CustomersStage
tables and populate them with sample data. This code
creates the tables in the tempdb database, for demonstration
purposes only. (Under normal circumstances,
these tables are in a user database.)
Table 1 and Table 2 show the current contents of
the Customers and CustomersStage tables. I’ll refer
to the Customers table as the target and the CustomersStage
table as the source. Updating the Customers
table with info from the CustomersStage table involves
two actions:
- Identifying customers that appear in both the
source and the target (based on a match between
the source and target custid values), then updating
target customers by overwriting all their attributes
(except custid values) with the corresponding source
customer attributes.
- Identifying customers in the source that don’t exist
in the target (based on the same condition), and
inserting those customer rows into the target.
To achieve this task in SQL Server versions prior to
2008, you’d probably use an UPDATE statement based
on an INNER JOIN followed by an INSERT statement
with a NOT EXISTS predicate or an OUTER
JOIN filtering only outer rows. Such a solution would
require you to access the tables twice. Also, such a
solution would complicate things in terms of handling
the operation as an atomic one—more specifically, in
avoiding conflicts with other INSERT operations to
the target that take place between your UPDATE and
INSERT operations.
In SQL Server 2008 you can use the new standard
MERGE statement, as the code in Listing 1 shows, to
achieve this task simply and efficiently. The semantics
of the MERGE statement are very similar to those of
an OUTER JOIN, with sections defining the target,
the source, and the action to take when a match occurs
or doesn’t occur.
The MERGE INTO clause defines the target table
(e.g., Customers) and optionally assigns it an alias (e.g.,
TGT). The USING clause defines the source table
(e.g., CustomersStage) and optionally assigns it an alias
(e.g., SRC). The ON clause defines the MERGE
condition.
In the WHEN MATCHED THEN clause, you
define an action to take against the target when the
source row has a matching target row based on the ON
predicate. In our example, the action is UPDATE—
that is, update existing target customers (2, 3, and 5)
with the more recent customer info from the source.
Notice the syntax of the UPDATE statement is
very similar to that of a plain UPDATE statement,
with one difference: You omit the name of
the target table because you already defined it in
the MERGE INTO clause. You can think of the
action in this clause as single row operation—one
UPDATE per each source row that finds a match
in the target based on the ON clause. However,
in practice, the MERGE statement is applied as a
single set-based operation for all rows.
In the WHEN NOT MATCHED THEN
clause, you define an action to take against
the target when the source row doesn’t find a
matching target row based on the ON predicate.
Going back to JOIN semantics, you can think of
the operation as source LEFT OUTER JOIN target.
The WHEN MATCHED THEN clause defines the
action to take in case of matches—namely, with inner
rows; the WHEN NOT MATCHED THEN clause
defines the action to take in case of non-matches—
namely, with outer rows. In our example, the action
is INSERT—that is, insert into the target source customer
rows that don’t exist in the target (6 and 7). The
statement is similar to a plain INSERT statement, with
one difference: Similar to the WHEN MATCHED
THEN clause, you don’t need to specify the target table
name because you already defined it in the MERGE
INTO clause.
In order to always start with the same sample data,
run the MERGE statement in each example in the
context of an explicit transaction, then roll it back
when finished (after querying the Customers table to
see the changes). For example, run MERGE statements
as follows:
BEGIN TRAN;
MERGE ...;
SELECT * FROM dbo.Customers;
ROLLBACK TRAN;
Continue on Page 2