DOWNLOAD THE CODE:
Download the Code 93114.zip

You probably know that you can use SQL Server 2005 Integration Services (SSIS) to build an Extract, Transform, and Load (ETL) system to populate a data warehouse. But how do you do it? What does an ETL application look like? SSIS boasts so many features that newcomers typically feel as if they've upended a jigsaw puzzle on their desktop. It's hard to put the puzzle together without seeing the big picture, especially when pieces from other puzzles are mixed in or when some pieces are missing! Our goal in this article is to provide that big picture: We'll cover basic SSIS package design and construction, and in doing so we'll provide a foundation for later studying incremental load techniques for dimensions and facts, as well as variables, scripting, and basic process auditing.

Establishing a Source
Before you start building an ETL system to populate your data warehouse, you should have developed the relational dimensional model to meet the business requirements of your user community. (For some foundational information, see "Dimensional Modeling Basics," April 2006, InstantDoc ID 49365.) You should have the source-to-target mapping that specifies where each column in the target data warehouse comes from, including a brief outline of the transformation rules. Also, you should have thought through some design and architectural concerns so that your ETL system is logical and consistent.

In this article, we use the familiar AdventureWorks database as the source system. We've already designed an abbreviated target dimensional model to hold orders data. This target model, which Figure 1 shows, has only four dimensions, including Date, which plays multiple roles. The model is simple (e.g., it has no Customer dimension), but it serves as a reasonable design pattern for building an ETL system. It illustrates dimensional-design best practices, including surrogate keys and several ways of handling changes to attributes' values.

We typically build two ETL systems: The first system loads historical data into the data warehouse, and the second system handles incremental loads. The historical load has to process more data and deal with historical incongruities, whereas the incremental load identifies new and changed rows and processes those changes. Some of the logic in the two systems will be identical, but you should plan to create two systems.

Before you start developing SSIS packages, you should do some planning. For each table in the data warehouse, think through—and document—all the data sources and transformations. The detailed system specifications you write could easily be dozens of pages long. At the very least, create a high-level map, as Figure 2 shows. In this map, the target data-warehouse tables are at the bottom, and the sources for each target table are at the top. Between the sources and targets, you should document the kinds of transformations that need to occur. (In this case, those transformations are simple.) Always note whether the dimension includes only Type 1 attributes (which are updated in place), Type 2 attributes (for which you track history), or both. Specify where you'll be capturing process metadata and performing data-quality checks.

Modularize Your Design
Although it's theoretically possible to write a single SSIS package that would populate all the tables in your data warehouse database, we don't recommend that you do so. Keep it simple by writing one package to populate each table. You should create a master package to kick off the processing of each table-specific child package. Master packages consist primarily of Execute Package tasks.

The first step is to create a project in SQL Server Business Intelligence Development Studio (BIDS) to hold the ETL application. We typically create one solution to hold all the SSIS packages, then create two projects for the historical and incremental loads. Open BIDS, and choose File, New Project. Highlight the SSIS project template. Name the project AWOrders_Hist, name the solution AWOrders, and select the Create directory for solution check box. Save the solution in a convenient location.

The next step is to create shared data sources to connect to the source databases and the target data warehouse database. By using shared data sources, all your packages will be able to share the same connection information. In our example, we're using AdventureWorks as the primary source database, and the new database AWOrders is the target. To create a shared data source, right-click Data Sources in the BIDS Solution Explorer pane and choose New Data Source.

Next, create a new empty package called Promotion, then create a package-specific connection for the source database AdventureWorks and for the target database AWOrders. Right-click the Connection Manager section at the bottom of the design space and choose New Connection From Data Source. The only Control Flow task you'll need for the first draft of the Promotion package is a Data Flow task. Find it in the Control Flow toolbox, and drag it onto the main design surface. You can edit the Data Flow task by double-clicking it or by choosing the Data Flow tab of the design surface.

Setting Up the Data Flow
The Control Flow and Data Flow design panes in BIDS look similar, but they're quite different, and their names help clarify their purposes. The Control Flow is where you control the package's logical flow. You can direct the package to take different paths under different conditions. For example, in the Control Flow, you can halt package execution if the extracted data fails a test such as a minimum number of rows.

The Data Flow pane is where data is extracted, transformed, and written to a table or file. The Data Flow efficiently processes batches of data (e.g., 10,000 rows) at a time. You can't halt package execution from within the Data Flow.

   Prev. page   [1] 2 3     next page
 
 

ADS BY GOOGLE