DOWNLOAD THE CODE:
Download the Code 21587.zip

Tasks 12 and 13: Cleansing the data. The FoodMart Access database suffers from data-quality problems. For this exercise, let's look at the snowflake schema for the sales subject area, whose key values and structure Figure 19, page 55, shows. The sales_fact 1997 table holds foreign key references to FoodMart's key dimensions: products, time, customers, promotions, and store geography. I structured the products and store dimensions in a snowflake pattern to reflect the hierarchies for each dimension; for example, each product has a product family, department, category, subcategory, and product brand. Several rows duplicate between sales_fact_1997 (8 rows) and sales_fact_1998 (29 rows). If you want to apply star indexes and referential integrity to the star schema, you have to purge the duplicated data. This challenge is nothing new to developers with data warehouse experience; typically 80 percent of total project time is spent on data cleansing. The ETL developer has to decide whether to use set-based processing or row-based processing for the data-cleansing phase of the project. For this example, I used set-based processing. To cleanse the sales_fact_1997 table, you can run the SQL code that Listing 6 shows.

The first step in cleansing the data is to find all the rows that contain duplicate entries and create a spot to store them; in this example, the code stores the results in a temporary table. Next, it deletes the duplicate entries from the fact table. Then, the code deletes the table that it used to store the duplicates. Note that in using set-based processing to cleanse information before inserting it into the star or snowflake schema, you introduce data loss because you don't re-insert the distinct duplicate rows that contain identical key values into the table. I decided to use set-based processing in this example because I don't know enough about the underlying data to determine which of the duplicate rows is the correct one. In a real project, you place these duplicate rows in a permanent table in a data warehouse or data mart metadata database that you establish to store rows that fail the data-cleansing process. You can then examine these failed rows to determine what exception processing should occur. The data mart database also stores additional information about package execution, source data, and other key information that describes and documents the ETL processes over time. (This subject is worthy of one or more separate articles.) After cleansing the fact tables, you can create star indexes and add referential integrity.

Task 14: Creating star indexes. Task 14 creates a primary key for both the sales_fact_1997 and sales_fact_1998 tables. The primary key, which is also called a star index, is a clustered index that includes each of the fact tables' foreign keys that reference a dimension's primary key. You can realize several benefits from creating primary keys; one significant benefit is that the query optimizer can use this primary key for a clustered index seek rather than a table scan when it builds its access plan. The query optimizer takes advantage of the star index in the code example that Listing 7 shows. Note that the query-access patterns demonstrate how much the star index can speed up your queries; for example, the execution time in the query that Listing 7 shows plummeted by two-thirds when I added the star index. Using a star index in queries for very large databases (VLDBs) carries another important benefit: The query optimizer might decide to implement a "star join," which unions the smaller dimensions together before going against the fact table. Usually, you want to avoid unions for the sake of efficient database optimization. However, a star join is a valid and clever optimization technique when you consider that the fact table might be orders of magnitude larger than its dimensions.

Task 15: Adding referential integrity. The last major task in this package adds referential integrity, which links all the star schema's foreign keys to their associated dimensions' primary keys. As a general rule, adding referential integrity is beneficial because it ensures that the integrity of the data mart is uncompromised during the load phase. Administrators for large data warehouses might choose not to implement this step because of the extra overhead of enforcing referential integrity within the database engine. Cleanup, the final task, uses an ActiveX script to invoke DBCC ShrinkDatabase and to clean up the connection that the global variables are storing. A production-quality DTS package includes additional tasks, such as a mail task that sends the status of the package execution to the DBA team.

Change Is Good
Sometimes little things make a big difference. This maxim is certainly true for SQL Server 2000's DTS enhancements. The Create Foodmart 2000 package showcases two new tasks in particular: the Dynamic Properties and Execute Package tasks, which help DTS programmers implement production-quality packages. And when Microsoft added I/O capabilities to the Execute SQL task, the company established global variables as the hub of activity within a DTS package.

End of Article

Prev. page     1 2 3 [4]     next page -->



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

I'm getting a 'task failed during execution', where step 11. Copy tables has 24 times a message: 1:foodmart copy tables) substep 'DTSSTep_DTSDataPumpTask_24' failed with the following error: Login failed for user '<not displayable>'. Looks to me the access user login but can't find more...

Ben van Zanten

google has a better search

faiz1

Article Rating 1 out of 5

 
 

ADS BY GOOGLE