DOWNLOAD THE CODE:
Download the Code 93114.zip

Extracting and Archiving Data
Data Flow tasks can be complicated, so we'll start with a simple one. In Figure 2, you can see that the Promotion dimension is sourced from only one table and requires little transformation. On the Data Flow tab, pull out the OLE DB Source from the toolbox pane and place it on the design space. Configuring the OLE DB Source is trivial: Simply point it to the package connection for AdventureWorks, and enter a source query. The default is to source from a table or view, but we recommend that you always write an explicit source query, as Listing 1 shows.

One reason for the explicit query is that it lets you rename columns and perform trivial transformations such as type casts. You have many opportunities within the Data Flow pipeline to do the renaming and casting, but we recommend doing it as early as possible. The renaming and casting puts negligible additional load on the source database. It's vital that your source queries minimize the load on the source system. Be particularly cautious about joining tables, and work closely with your source system DBAs to ensure that they approve of your extract queries.

Be sure to rename the objects in your package. In this case, rename the OLE DB Source as Source from Sales SpecialOffer. By doing so, you ensure that you'll be able to make sense of what your package does when you look at it a few months later.

Your next step is to save a copy of the extracted data before you perform any additional transformation. Archiving the extract is a best practice—not a technical requirement. Your internal audit department will be happy to see that your ETL system automatically archives extracted data for an established time frame (typically a month or two).Archived extracts can be invaluable in situations in which you need to modify the ETL system and re-run a month of loads, for example.

To archive the extracted data, add the Multicast transform onto the design surface and hook it up to the source. Multicast makes two or more identical copies of the data stream. Add a Raw File Destination to the design surface, and hook it up to the Multicast. The Raw File Destination is a file format that's unique to SSIS. It's easy to configure and writes data very fast. However, the only way to read a raw file is from within an SSIS package, so if you want a person to look at the output file, it's best to use the Flat File Destination. To set up a Raw File Destination, simply specify the destination file path and name, then go to the Input Columns tab to specify the columns you want to store.

Transforming the Data
Whereas archiving the extracted data is an important peripheral task, your primary goal is to transform and write the data to the target table. In Figure 2, you can see that the transformation involves fixing up a NULL value in the MaxQty column, which identifies the maximum quantity for which a marketing promotion is valid. If there is no maximum quantity (i.e., if the promotion applies no matter how many units are purchased), the source system puts NULL in MaxQty. You'll want to replace that NULL value with the largest possible integer that will fit in the column—that is, 2,147,483,647.

Although you can use SQL Server to perform this transformation in the extract query, we recommend moving all nontrivial transformations into SSIS. Although this transformation barely qualifies as nontrivial, it makes a good example. Pull out a Derived Column transform, and hook it up to the Multicast. Replace the contents of the MaxQty column with an expression written in the SSIS expression language, as Figure 3 shows. (No doubt Microsoft had good reasons for coming up with a whole new language for SSIS expressions.) As always, rename the transform—in this case, to Replace NULLs.

Next, drag out a second Derived Column transform and hook it up to the Replace NULLs transform. Create two new columns, InsertAuditKey and UpdateAuditKey, and set them both to -1. Adding audit keys to all tables in your data warehouse is a good idea, and we'll discuss that further in a future article. Setting the keys to -1 will work for now.

You might wonder why we told you to create two Derived Column transforms in a row. You can perform multiple transformations in one Derived Column transform, so what's the value of separating them? The value is in readability and repeatability. The first transform has to do with real data, and the second one has to do with auditing metadata. You can include the auditing metadata transform as a standard component of your packages because it's always configured the same way. Sure, it's marginally less efficient when the package runs, but the difference is tiny.

Loading the Data
Finally, it's time to write the data. Drag out an OLE DB Destination, and configure it to point to the Dim-Promotion table in AWOrders. Go to the OLE DB Destination Editor's Mappings pane and ensure that the correct source column feeds all target columns. Because you changed column names in the source query, everything matches up correctly, so you shouldn't need to edit anything to achieve the mappings that you see in Figure 4.

Whenever you write data to a table, you should explicitly handle errors. Go to the Error Output tab of the Destination adapter, and change it so that error rows are redirected. If for any reason you can't insert a row into the target table, the default is to fail the transform, and hence the package. For now, simply add a Raw File Destination to the error flow to dump any bad rows into a file, and add a Data Viewer to the error flow by right-clicking the red arrow and choosing DataViewers, Add, OK. There are several kinds of data viewers available, but we've found the Grid to be most useful.

Figure 5 illustrates the completed Data Flow. You can test the package by right-clicking the package name in the Solution Explorer pane and choosing Execute Package. If everything works, you should see 16 rows added to the DimPromotion table. The error flow Data Viewer appears, but there are no error rows. The Data Flow in Figure 5 is a template for all dimension-table historical-load packages: extract, archive the extract, transform, and load.

If you re-execute the package, you'll notice that it adds 16 more rows to the dimension table. That's because the dimension's primary key is a surrogate key, and we haven't added any checks to ensure that we're not adding the same row twice. It's common for historical-load packages to delete any existing rows from the dimension table before processing begins. In the Control Flow, add an Execute SQL task to the design space, and set it up to truncate the dimension table. Connect the two tasks with a precedence constraint.

Prev. page     1 [2] 3     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

The author provides little or no detail about how to do many important steps. I wouldnt call this a "how-to" article because it will leave beginner SSIS users stuck.

vette67

Article Rating 3 out of 5

To vette67 - can you be more specific about the important steps you want detailed? We may have covered or will look at covering it in future articles.

DianaMay

Article Rating 4 out of 5

 
 

ADS BY GOOGLE