• subscribe
October 26, 2011 05:48 PM

SSIS Novices’ Guide to Data Warehouses: Flattening While Staging the Data

Flattening is challenging but vital
SQL Server Pro
InstantDoc ID #136269
Downloads
136269.zip

In "SSIS Novices' Guide to Data Warehouses: Moving Data Into the Data Warehouse", I showed you the basic structure of a data warehouse whose databases contain sets of tables that store raw, staged, and dimensionally modeled data. These tables are referred to as the Raw tables, Stage tables, and Dimensional tables, respectively. I also showed you how to create a SQL Server Integration Services (SSIS) package called the Raw package. This package is used to move a near exact copy of your source data from an external location (probably the transactional database and server) to the Raw tables.

The next step in building a data warehouse is moving the data from the Raw tables to the Stage tables, which is referred to as staging the data. During the staging step, many changes are made to both the data's structure and content. The changes are driven by business rules and dimensional architectural needs.

When explaining the kind of changes that take place in the staging step, I like to refer to something called the Five F Words. Although some aspects of data warehousing can be frustrating, none of these F words will get you in trouble at work. In fact, only one has four letters, and that word is Flag.

As Figure 1 shows, the Five F Words are Flatten, Fix, Flag, Filter, and Figure. In this article, I'll cover Flatten. I like to start with this F Word because flattening data is often the hardest concept for new warehouse designers to grasp, especially if they come from a transactional database design background.

Figure 1: The Five F Words
Figure 1: The Five F Words

Flattening 101

Designing a transactional database is all about getting it into third normal form (3NF). Normalization becomes second nature to database developers, which is why data warehouse design often feels very foreign to database developers who are building their first data warehouse. Many tables within warehouses are denormalized or flattened when compared to their transactional system counterparts. Transactional systems need to load new data quickly and can return report data slowly, as reporting is a secondary, less crucial activity. Warehouses can load data slowly (often in the wee hours of the morning), but should return report and ad-hoc query data quickly, as reporting is the primary function of a data warehouse. Denormalization or flattening of data enables fast reporting.

To denormalize a database, you combine one or more tables into a single table. In a standard, no-frills dimensional database design, you flatten the various tables of the transactional system so that relationships between the tables form a star pattern. This is referred to as a star schema. (Other schemas exist, but I recommend that new designers always start with the star schema.)

As a general rule, transactions and events from the transactional database become facts, whereas lookup tables and profile data become dimensions. The fact table becomes the center of the dimensional model; each dimension hangs off the fact table, just one key relationship away from the center of the model.

Knowing which tables need to be flattened is the essence of dimensional database design. New data warehouse developers shouldn't expect to get it right the first time, every time. But take heart in the idea that it will get easier the more you do it.

Flattening StateUBags' Transactional System

In "SSIS Novices' Guide to Data Warehouses: Moving Data Into the Data Warehouse," I presented a sample scenario in which the StateUBags website sells backpacks and messenger bags to college students. As Figure 2 shows, StateUBags' transactional database system uses an Order table to track sales. The Order table rows have child records in the Order Detail table.

Figure 2: Transactional database model
Figure 2: Transactional database model

To move from a fully normalized model to a denormalized dimensional model, you typically need to combine, add, and remove tables. For this example, you need to combine the Brand, Line, Product, and Manufacturer tables into a single denormalized table, as Figure 3 shows. You then link this new dimension to the Fact Orders table through the Product ID. Similarly, you need to combine the Color and Color Group tables into a single denormalized table, which links to the Fact Orders table through the Color ID.

Figure 3: Dimensional database model
Figure 3: Dimensional database model

To demonstrate how this will create a denormalized structure, suppose that product "A" is always the color "Daffodil" and the Color Group for "Daffodil" products is always "Happy Colors." In a normalized structure, product "A" would be defined in the product table with a list of attributes. One of these attributes would be a link to the "Daffodil" value in the Color table. A link in the Color table would then note that "Daffodil" is in the Color Group named "Happy Colors."

If you have 1,000 transactions in which product "A" was sold, all 1,000 rows would point back to a single row in the Color table, which would in turn point to a single row in the Color Group table. The word "Daffodil" would be stored in the database just once in that single row in the Color table and the label "Happy Colors" would also appear just once in a single row in the Color Group table. When reports were run from the normalized database for product "A", the word "Daffodil" and the label "Happy Colors" would be assigned back to product "A" through a lookup process.



ARTICLE TOOLS

Comments
  • dclavell
    7 months ago
    Oct 27, 2011

    Not seeing the instantdoc id text box.
    if I search for 136269 it just brings up this article.

You must log on before posting a comment.

Are you a new visitor? Register Here