A Type Two change answers the sales team dilemma. With a Type Two change, you don't need to make structural changes to the Sales_Person_Dimension table, but you add a record. Suppose you have a record that looks like Table 1. After you have implemented the Type Two Change, two records appear, as Table 2 illustrates. Each record is related to appropriate facts, which are related to specific points in time in the Time Dimension. This capability illustrates a key benefit of the dimensional model. Although dimensional star schema models are denormalized by OLTP standards, they are partially normalized. One dimensional record may relate to thousands or even millions of fact records. Using a dimensional model drastically reduces—but doesn't eliminate—data redundancy, reduces the database size accordingly, and makes joins more efficient.

This example illustrates two ancillary but important design features. First, the primary key for the Sales_Person_Dimension table is meaningless or generalized, and is typically just a sequential number. You need to avoid designating a primary key that has business value, such as the SALES_PERSON_ID. When you design an OLTP database, you generally don't want to use a field that has meaning as a primary key. This rule is more important in a dimensional model, and even more important when you're implementing Type Two changes. If, for example, you use the SALES_PERSON_ID as the primary key and you assume this business key will always remain unique (which is a dangerous assumption), a Type Two implementation needs to accommodate multiple records per person. In the example of the Type Two change above, the SALES_PERSON_ID never changes. The person is still the same person, but his or her information changes over time. Attempting to use the SALES_PERSON_ID field would cause a primary key integrity violation when you try to insert the second row of data for that same person.

Suppose the source system for the database uses a meaningless key that is guaranteed to never change. Consider implementing the primary key as a composite key that comprises the primary key from the source system and a sequence key identifying the number of the change record in relation to the first record for that entity. Table 3 illustrates the use of a composite key.

Although this composite key can help you move data from the source, it increases the size of the key in the table and in its related fact tables. You need to weigh the efficiencies you'll gain from using the source system key against the related increase in key size. Again, you take a risk if you assume that the source system keys won't change or will always be unique. Assuming that the source system keys won't change forces the data warehouse to depend on the referential integrity of the source system. Any major change to the source system could significantly affect the data warehouse. If the data warehouse has a separate, meaningless key, the data warehouse will more likely be encapsulated well enough that a major source system change would affect the warehouse only internally and would not affect the interfaces to the warehouse.

The second design rule that this example illustrates is the lack of effective dates that identify exactly when the change in the SALES_PERSON_DIMENSION occurred. You don't need effective dates because the TIME_DIMENSION will effectively partition the data over time when it's related to the fact table. The effectiveness of this partitioning depends on the grain of the fact table. If a company slices the facts every day, the system will record the date of the change, but not the hour and minute of the change. The system will record the value the attribute had at the end of the time period.

Although Type Two changes add some complexity, they are manageable in the context of this simple dimensional model. You can design the simplest data marts with a single fact table and a few dimension tables. Multiple fact tables or a snowflake design can dramatically increase the complexity. However, before we look at these issues, let's address implementing slowly changing dimensions in a normalized model. The same issues that arise in multiple fact tables and snowflakes also arise in a normalized design.

Implementing the Change Types in a Normalized Model
In a normalized model (at the third normal form), fact tables and the time dimension might disappear, but the problems in implementing slowly changing dimensions are the same. You can implement any of the three change types, but their implementation becomes more difficult. At first, a normalized model might not seem worth considering in the context of a data warehouse. But normalized models have their place in data warehousing. You encounter dimensional models, especially ones with a single fact table, more often in departmental data marts because you can reduce a department's data requirements to a single set of facts. Enterprise data warehouses (which span the enterprise and have multiple subject areas and data sources), operational data stores (which offer realtime or near realtime data), and data staging areas often take on a more normalized style or hybrid implementation. The ultimate benefit of a dimensional model is the query efficiency for the end users. Enterprise data warehouses, operational data stores, and data staging areas need the capability to load in large amounts of data very fast. A relational design serves this need well.

You implement Type One and Type Three changes in a normalized model the same way you do in a dimensional model, but because a normalized model has no time dimension, implementing Type Two changes is different. Figure 3 illustrates the previous dimensional model translated into a normalized style. The database system uses effective start and end dates instead of the time dimension to store the period for which each record is valid. The previous sales team example before the Type Two change occurred would look like Table 4. After you implemented the Type Two Change, the example would contain two records, as Table 5 illustrates.

A Type Two change would cause an insert, as in the dimensional model, but also an update in the old record. In this normalized model, the SALES_PERSON table's primary key is a composite key of a sequential number and the record's effective start date. An update of the existing record doesn't update its primary key, but by adding a value to the effective end date, the update bounds the period for which this record is valid. The new record has a new effective start date and a NULL for the effective end date, showing that the record remains current.

For querying and reporting purposes, the most current record has an indicator flag set to Y, which means that for the given entity, this record is the most current one in the table. You'll find this marker useful when you query the system if it's an operational data store or when you load data into a dimensional data mart if the table is part of an enterprise data warehouse.

Detecting Changes When Loading Data
Determining which change types to implement is your first step in managing slowly changing dimensions. Detecting a change is the next step. How and whether you need to detect changes depends on the data from the source system. Describing the many variations of accessing the source data is beyond the scope of this article, but suffice it to say that the data load programs' complexity primarily depends on the format of the source data. An example of a complex process would be one in which the data warehouse receives a transactional feed from the source system or only a copy of the source system itself, meaning you have to do the compare snapshots of the source data. You need to compare how the incoming data currently looks to how it looked the last time the data was loaded. An example of a process that's easier to manage is one in which the extraction from the source is only a change file with an addition, update, or delete flag appended to each record.

Note that if you use a Type One and Type Two hybrid, the change you're implementing depends on the specific attributes you're changing. For example, in the Person Dimension in Figure 4, you might determine that the attribute Hobbies is worthy of only a Type One change, but an attribute such as City or State might warrant a Type Two change. An advantage of using this technique is that you can create Type Two (additional) records when you need them, which saves database space and helps query performance. A disadvantage of using this technique is that it requires an extra step to determine whether a Type Two attribute was changed. You can easily detect these changes by comparing the source attribute to the target attribute. But this comparison can become lengthy and unwieldy if you need to compare many attributes. Every attribute addition to the database would require a code change to this compare routine.

As an alternative, you can build an internal table to store the attributes that signal Type Two changes. This table requires only two fields, TABLE_NAME and ATTRIBUTE_NAME. You can then populate the table with all the attributes in the database that signal a Type Two change. The load program (bulk copy program—bcp—won't work here) then needs to query this table, find the appropriate attributes for the table it's loading, and use the result set to build a list of Type Two attributes to check for. This approach allows the designation of Type Two attributes to be dynamic, readily accommodating changes to the table structure.

Cascading Type Two Changes
In a normalized model or a snowflake dimensional model, a Type Two change to a primary key of one table can—and often does—cause the need for an update to the foreign keys of records in related tables. You need to deal with this consequence programmatically with stored procedures or triggers. In SQL Server 2000, look forward to cascading declarative referential integrity (DRI) to significantly ease this burden.

Figure 5 shows a dimensional model with a snowflake design. In this design, you can store more detailed information about the sales region without adding undue overhead to every SALES_PERSON_DIMENSION record. However, the snowflake design of the SALES_PERSON_DIMENSION and REGION table further complicates implementing Type Two changes that you need to resolve with programming logic. If you implement a Type Two change to a record in the SALES_PERSON_DIMENSION, you need to preserve only the SALES_PERSON_REGION_KEY attribute on the record. However, suppose you implement a Type Two change on the Region table. At minimum, you need to take every record in the SALES_PERSON_DIMENSION that includes the old foreign key and update it with the new foreign key for the Type Two change. And suppose a change to the SALES_PERSON_REGION_KEY is a Type Two change in the SALES_PERSON_DIMENSION table. Then for every SALES_PERSON_DIMENSION record, you need to create a Type Two change linking the record to the new Type Two change in the Region table.

Aggregations and Type One Changes
Implementing aggregate dimensions in a database can significantly improve performance for certain queries. However, the performance benefits come at a price. Because Type One changes can overwrite history, they can also change the aggregated history in any aggregated tables. Therefore, you need to implement a process to accurately and efficiently update the aggregate tables. In general, you can solve this problem in one of three ways:

Recalculate the aggregate tables. You can rebuild the entire physical aggregate dimension from scratch. This method doesn't require extra code, but the processing times might be long, depending on the data volumes.

Use virtual aggregates. With this method, you don't build an aggregate dimension. Instead, you implement the aggregated dimensions with views by using the GROUP BY parameter in your SELECT statement. This method doesn't require extra updating logic, but the query times might be long and it serves only the most basic aggregations.

Use corrected transactions. With this option, the aggregate table physically exists, and you need to write a custom program to detect changes in the detail dimension and create a transaction that updates the associated aggregate records.

Designing for Change
If you design or use a data warehouse, you can't avoid slowly changing dimensions and their considerable design and development overhead. The challenges of handling slowly changing dimensions go far beyond the examples in this article. For example, implementing slowly changing dimensions in OLAP Services presents complex challenges (for more information, see "Data Warehousing Step By Step," page 25). If you design a data warehouse with a clear understanding of the types of changes, the business need to preserve historical information, and how to properly implement history, the data warehouse will meet users' expectations well into the future.

End of Article

Prev. page     1 [2]     next page -->



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

A very good explanation of slowly changing dimensions, much better than what you would find in the textbooks. I printed it and will keep it for new team members. Thanks!

Steven Barger

Excellent article. Even amatures can easily understand!!

Shree

It is a very detail & integrated article about Implementing Slowly Changing Dimensions. I share it with my classmates.

Katherine Yeung

really helpful..

debyani

This is an excellent article to describe the DW dimension data loading design considerations.

Sumanta Gayen

Excellent article! Clear, concise, informative

Anonymous User

Article Rating 5 out of 5

This article is excellent! It is so informative that even a novice could understand. Great work!!

Anonymous User

Article Rating 5 out of 5

An excellent article. Clear, concise and informative, for both explanation and considerations required.

Anonymous User

Article Rating 5 out of 5

Very good

Phil Hartland

Article Rating 4 out of 5

Good article

Anonymous User

excellent information

Anonymous User

Article Rating 5 out of 5

excellent information

Anonymous User

Very clearly written - very helpful.

Anonymous User

Article Rating 5 out of 5

very informatice article, thanks!

Anonymous User

Article Rating 4 out of 5

This is a great article with lot of information for even the bigners

Anonymous User

Article Rating 5 out of 5

See More Comments  1   2 
 
 

ADS BY GOOGLE