SQL Server 2008 has a new feature called Change
Data Capture (CDC), a component of the relational
database engine that provides change data
for consuming applications. CDC populates change
tables by asynchronously reading a given database’s
transaction log. CDC’s most notable use is for datawarehousing
apps in which you need to detect and
process changed data on a recurring basis in extraction,
transformation, and loading (ETL) processes. With
CDC, you can detect changed data in source OLTP
systems that run on SQL Server 2008 without needing
to implement expensive custom solutions such as timestamps,
flags, triggers, and other one-off solutions for
detecting changed data.
To begin learning about CDC, let’s enable CDC for
a particular database. We’ll then designate the source
tables to have their information captured, and thereafter
any changed data will be recorded accordingly. To
use CDC, you must have SQL Server 2008 Enterprise,
Developer, or Evaluation editions.
What CDC Offers
When you enable a source table for CDC, you’re effectively
creating a capture instance. You can have as many
as two capture instances per source table. The actual capture
process is implemented with SQL Server agent jobs
calling other routines. Any change data detected is then
inserted into a separate change table per table enabled.
Once the change data has been made available for consumption
purposes, a consumer (typically an ETL process)
will query the change data using system-generated
functions by either log sequence number (LSN) or time
ranges. The generated CDC functions provide the change
data in an easy-to-consume query format.
You might be wondering when the change tables
are purged. After all, if the change tables are never
purged, they can grow to huge proportions. By default,
CDC purges data from the change data tables after
it’s three days old. Every change table defaults to this three-day purging strategy, which you might not
always consider optimal. If you need to deviate from
the default purging strategy, system procedures are
available for you to do so. If you back up and restore a
CDC-enabled database, the restored database doesn’t
have CDC enabled by default.
Finally, CDC offers the ability to capture and
propagate Data Definition Language (DDL) changes.
By default, CDC doesn’t propagate source-table DDL
changes to its corresponding change table. The only
exception to this rule is when you alter the underlying
data type of a particular column in the source table; the
change table will automatically have its corresponding
column type changed. Removed columns simply get
NULL values in their change table, and added columns
are ignored in the change table.
To enable true DDL propagation, you need to
create a second capture instance of the source table
after the DDL statements are issued. This procedure
will begin capturing change data using the new table
schema. You will then need to update any consuming
applications to use the new capture instance.
Configure CDC
To configure CDC, you use T-SQL with a SQL Server
2008 instance. We’ll be using the AdventureWorks 2008
OLTP sample database for our configuration section.
If you don’t have a copy of the AdventureWorks OLTP
sample database, you can download a copy at http://www.codeplex.com/MSFTDBProdSamples. Also,
you’ll need SQL Server Agent running.
The first step is to enable a given database for CDC.
To do so, you use the
sys.sp_cdc_enable_db
system procedure.
This procedure creates
all of CDC’s
related objects, jobs, and so on in the current database,
as follows:
USE [AdventureWorks2008];
GO
EXEC sys.sp_cdc_enable_db;
To determine whether a database has CDC enabled, we
can interrogate the sys.databases catalog view. A new
column called is_cdc_enabled in the sys.databases view
is either a 0 (not enabled) or a 1 (is enabled). The following
query determines whether the AdventureWorks
sample database has been enabled for CDC:
USE AdventureWorks2008;
GO
SELECT [is_cdc_enabled] FROM sys.databases
WHERE [name] = ‘AdventureWorks2008’
Next, we need to create our capture instance(s) by
using the sys.sp_cdc_enable_table_change_data_capture
system procedure. This system stored procedure enables
CDC on a given source table. The procedure that Web
Listing 1
shows creates the associated change tables and systemgenerated
change-data query functions.
To determine whether a given table is tracked by
CDC, you can interrogate the sys.tables catalog view. A
new column called is_tracked_by_cdc in the sys.tables
view contains either a 0 (not tracked) or a 1 (is tracked).
The following query determines whether CDC has
tracked the Sales.SpecialOffer table:
USE AdventureWorks2008;
GO
SELECT [is_tracked_by_cdc] FROM sys.tables
WHERE [name] = ‘SpecialOffer’
Create the CDC-Consuming
SSIS Package
Once you’ve configured CDC, you can create a CDCconsuming consuming
SSIS package (i.e., one that consumes the
change data). I disagree with the current SQL Server
Books Online (BOL) documentation, which states that
CDC is a new SSIS technology. CDC is
a new feature of the relational database
engine: Any tool that can make a connection
to SQL Server 2008 can use the
exposed change data.
Continue to page 2
First, let’s fire up Business Intelligence
Development Studio (BIDS),
which is accessible via Start, All Programs,
Microsoft SQL Server 2008,
SQL Server Business Intelligence Development
Studio. Select File, New, Project
from the BIDS menu. Next, choose the
Integration Services Project and type ssis_CDC_Sample in the Name textbox,
as Figure 1 shows. Click OK to create
the new SSIS project.
Next, we need to set up our package
before adding Control Flow components.
Select View, Other Windows,
Variables to display the Variables window. Use the Add
Variable toolbar command to create the variables that
you see in Web Table 1. Enter the following T-SQL
code for the value of the SqlDataQuery variable:
SELECT [SpecialOfferID],\[Description],
[DiscountPct],\[Type],\[Category],
[StartDate],\[EndDate],\[MinQty],
[MaxQty],\[CDC_OPERATION] FROM
udf_Sales_SpecialOffer(null,null)
We need to create the package’s connection managers.
Right-click in the Connection Managers pane
(located beneath the Control Flow designer's surface),
and select the New OLEDB Connection context-menu
option. Create two new local OLEDB connections; the
first should reference the AdventureWorks 2008 database,
and the second should reference the Adventure-
Works DW 2008 database. Name the new connection
managers AdventureWorks and AdventureWorksDW,
respectively.
Now that the package is set up, we can delve into
the actual Control Flow elements. First, copy an
instance of the Execute SQL Task onto the Control
Flow designer's surface and rename the task Calculate
Interval. This task will serve as the entry step toward
obtaining a datetime range that the system will use
to determine and process any change data within the
range. Right-click and select the Edit option on the
task. Configure the Calculate Interval task as Web
Table 2 shows. Configure the result sets for the Calculate
Interval task as Web Table 3 shows.
Click OK to complete the Calculate Interval task.
Next, we need to place an instance of the For Loop
container onto the Control Flow surface. Rename the For Loop container to Wait for Change Data.
This For Loop container will serve as the looping
mechanism when change data isn’t yet ready. CDC
is an asynchronous process; sometimes, no change
data will be available to be captured. Also, note
that in my design we use a date interval of six
minutes ago to one minute ago, so all change data
we attempt to process will be at least a full minute
old. Edit the Wait for Change Data container as Web Table 4 shows.
Click OK to complete the Wait for Change Data container. At this point, create a default success
constraint between the Calculate Interval and the Wait for Change data tasks. Your package’s Control
Flow should now look like Figure 2.
Next, drag an instance of the Execute SQL task
inside the Wait for Change Data container. Rename the
task Check for Data. This task will poll the cdc
.lsn_time_mapping table to determine whether
change data is ready. Configure the Check for
Data task as Web Table 5 shows. Enter Web
Listing 2’s T-SQL code for the SQLStatement
property, and refer to Web Table 6 for the task’s
parameter mappings.
Click OK to complete the Check for Data task. Next, drag an instance of the Script Task
to the Control Flow designer's surface inside
the Wait for Change Data container. Rename
the task to Delay in Seconds. This is the task
used for delaying package execution for the
case when the change data isn’t yet ready for
the calculated interval. Configure the Delay in
Seconds task as Web Table 7 shows, using the
code in Web Listing 3.
Next, we need to create a Control Flow constraint
between the Check for Data and Delay in
Seconds tasks. Click the Check for Data task and connect
the task constraint to the Delay in Seconds script
task. Now, double-click the constraint and configure
it as Web Table 8 shows. Click OK to complete the
constraint's configuration. Your package should now
look like Figure 3.
Drag another instance of the Script Task to the
Control Flow surface inside the For Loop container.
Rename the Script Task Log Extract Error. This task
serves as the point in the package’s Control Flow at
which we log an error and stop package execution. This
task will execute when no change data is present for
our calculated interval. One possible design alternative
would be to wrap this entire package’s functionality
into a second For Loop container, then simply continue
polling for change data instead of failing the package.
You could insert a Script Task or Execute SQL Task to
delay package execution for five minutes per outer iteration.
Configure the Log Extract Error script task as Web Table 9 shows, using the script in Web Listing 4.
Next, we need to create a Control Flow constraint
between the Check for Data and Log Extract Error tasks. Click the Check for Data task and connect the
task constraint to the Log Extract Error script task.
Now, double-click the constraint and configure it as Web Table 10 shows. Click OK to complete the constraint's
configuration. We’ve now completed the work
that needs to occur inside the Check for Data For Loop
container.
Open SQL Server Management Studio (SSMS) and
create a new query window connected to the Adventure-
Works OLTP sample database. Execute Web Listing 5’s
T-SQL code to create our custom user-defined function
(UDF); this function gets called from the next task in
our SSIS Package. This function returns our example’s
change data by using the Net Changes function for a
given time period. Notice that the function converts the
input parameters for the time to their equivalent LSNs.
Confirm that the function has been created,
and clear the query window’s contents by pressing
Ctrl+Shift+Delete. (We’ll need to use SSMS again a
bit later.) Now, go back to our SSIS package in BIDS
and drag another instance of the Execute SQL Task
onto the Control Flow. Rename the new Execute SQL
task Prepare Query, Use UDF. This task will create the actual T-SQL query for obtaining the change data
using our calculated datetime intervals. Configure
the Prepare Query, Use UDF task as Web Table 11 shows, and enter Web Listing 6’s T-SQL code for the
SQLStatement property. Configure Prepare Query,
Use UDF parameters as Web Table 12 shows. Finally,
configure the Prepare Query, Use UDF result set as Web Table 13 shows..
Next, create a success constraint between the Check
for Data For Loop container and the Prepare Query, Use
UDF task. Finally, drag an instance of the Data Flow
task onto the Control Flow designer's surface. Rename the task Extract and Load Change Data. Create a basic
success constraint between the Prepare Query task and
the Extract and Load Change Data task. With the package’s
Control Flow complete, it should look like Figure
4. Next, we’ll configure the Data Flow contained within
the Extract and Load Change Data task.
Continue to page 3
Create the Data Flow
Right-click the Extract and Load Change Data Data
Flow task, and select Edit. Doing so will open up the Data Flow task’s corresponding Data Flow
designer's surface. Drag an instance of the OLEDB
source from the toolbox to the Data Flow designer's
surface. Rename the source component Change Data.
Configure the Change Data source component as Web
Table 14 shows.
Click OK to complete the OLEDB source component’s
configuration. Next, copy an instance of
the conditional split transformation onto the Data
Flow designer's surface. Rename the conditional
split transformation Filter Based on Operation. This
transformation will redirect each row based on the
CDC_OPERATION column that our custom UDF
returns. Configure the conditional split transformation
as Web Table 15 shows.
Click OK to complete the Filter Based on Operation configuration. Next, drag an instance of the OLEDB
Destination component to the Data Flow designer's
surface and connect the Data Flow from the Filter
Based on Operation transformation to the new OLEDB
destination. The source Data Flow should be Inserted
Records. Rename the component Process Inserts.
Configure the Process Inserts destination component
as Web Table 16 shows. Web Table 17 shows the task's
column mappings. (Destination columns that don't
have any inputs from the Data Flow should be set to
ignore for their corresponding input column.)
Click OK to complete the Process Inserts component.
Next, drag an instance of the OLEDB command
transformation to the Data Flow designer's surface.
Rename the new OLEDB command component Process Updates. Connect the Data
Flow path from the Filter Based on Operation
transformation to the new Process Updates OLEDB command using the Updated
Records output. Configure Process Updates as Web Table 18 shows, and enter Web Listing
7’s T-SQL code for the SqlCommand property.
Web Table 19 shows the Process Updates
column mappings.
Click OK to complete the Process
Updates OLEDB command. Finally, drag
over one more instance of the OLEDB
command transformation to the Data
Flow’s surface. Connect the Data Flow
from the conditional split transformation to
the new OLEDB command using the Deleted Records output. Next, rename the new OLEDB command Process Deletes. Configure the new Process Deletes
OLEDB command as Web Table 20 shows. Enter the
Web Listing 8's T-SQL code for the SqlCommand
property. Configure Process Deletes column mappings
as Web Table 21 shows.
Click OK to complete the Process Deletes OLEDB
command configuration. The package’s Data Flow
should now look like Figure 5.
Test the CDC-Consuming
SSIS Package
First, build the SSIS package in BIDS. Go to Build,
ssis_CDC_Sample. To test our new CDC-consuming
package, we’ll submit a series of UPDATE statements
to the Sales.SpecialOffer table. These T-SQL
statements emulate a front-end application, letting
users update the company’s various special offers.
Web Listing 9’s T-SQL code is to be submitted
against the AdventureWorks OLTP 2008 database
inside your original SSMS query window used in the
above tasks. (You can clear its old content by pressing
Ctrl+Shift+Delete.)
Next, execute the package. If change data was
found, the package’s Control Flow status should look
like Figure 6. If you query the DimPromotion table
in the AdventureWorksDW database, you’ll find that
the corresponding record’s DiscountPct is updated as
well. This confirms that we captured and processed our
change data from the AdventureWorks OLTP 2008
database.
Looking to the Future
I’m impressed with the new CDC feature built into
the SQL Server 2008 relational database engine. Overall, the feature is good; it replaces the need to
create custom CDC solutions and provides increased
performance over those custom solutions because it’s
asynchronous. If your organization has SQL Server
OLTP environments, CDC is a great candidate for
your future incremental ETL solutions!