A well-designed data warehouse accommodates change
Handling changes to dimensional data across time is one of the trickiest aspects of designing a data warehouse. Dimensional attributes rarely remain static. An individual's address can change, sales reps come and go, and companies introduce new products and phase out old ones. Changing dimensional data can present far-ranging implications when you view the changes over time. For example, if a company reassigns a sales territory to a new sales representative, how can you record the change without making it appear as if the new sales rep had always held that territory? If a customer's name changes, how can you record the change and preserve the old version of the name? Designing a database that accurately and efficiently handles changes is a critical consideration when you're building a data warehouse. In this article, I'll define slowly changing dimensions, discuss the types of changes that a data warehouse needs to accommodate, and show you how to design and maintain a data warehouse that effectively handles change.
Slowly Changing Dimensions Defined
In a dimensional model, you're most interested in reporting on and analyzing facts and measures. However, facts and measures are relevant only when you define them in the context of their dimensions. Dimensions and their attributes, such as those in the examples above, are relatively constant, but they do change over time.
The term slowly changing dimensions is the variation in dimensional attributes over time. The word slowly in this context might seem incorrect. For example, a salesperson or store dimension, as Figure 1 shows, might change rapidly if a company reorganizes. But in general, when compared to a measure in a fact table, changes to dimensional data occur slowly.
The Types of Slowly Changing Dimensions
Within a data warehouse, you can handle slowly changing dimensions in several ways. These methods fall into various categories based on the company's need to preserve an accurate history of the dimensional changes. Ralph Kimball, author of The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses (John Wiley & Sons, 1996), categorized dimensional changes into three typeschanges that overwrite history (Type One), preserve history (Type Two), and preserve a version of history (Type Three). The method you use to handle change depends on your company's need to maintain an accurate history of the dimension and the frequency of changes you expect. Here are definitions for each dimensional change type.
Overwriting history (Type One). A Type One change overwrites an existing dimensional attribute with new information. In the customer name-change example, the new name overwrites the old name, and the value for the old version is lost. A Type One change updates only the attribute, doesn't insert new records, and affects no keys.
Preserving history (Type Two). A Type Two change writes a record with the new attribute information and preserves a record of the old dimensional data. Type Two changes let you preserve historical data. Implementing Type Two changes within a data warehouse might require significant analysis and development. Type Two changes accurately partition history across time more effectively than other types. However, because Type Two changes add records, they can significantly increase the database's size.
Preserving a version of history (Type Three). You usually implement Type Three changes only if you have a limited need to preserve and accurately describe history, such as when someone gets married and you need to retain the previous name. Instead of creating a new dimensional record to hold the attribute change, a Type Three change places a value for the change in the original dimensional record. You can create multiple fields to hold distinct values for separate points in time. In the case of a name change, you could create an OLD_NAME and NEW_NAME field and a NAME_CHANGE_EFF_DATE field to record when the change occurs. This method preserves the change. But how would you handle a second name change, or a third, and so on? The side effects of this method are increased table size and, more important, increased complexity of the queries that analyze historical values from these old fields. After more than a couple of iterations, queries become impossibly complex, and ultimately you're constrained by the maximum number of attributes allowed on a table.
Because most business requirements include tracking changes over time, data warehouse architects commonly implement Type Two changes. A data warehouse might use Type Two changes for all attributes in all tables. As an alternative, you can implement a mix of Type One and Type Two changes at an attribute level by implementing Type 2 changes for only attributes whose historical values are important when you're slicing and dicing. For example, users might not need to know an individual's previous name if a name change occurs, so a Type One change would suffice. Users might want the system to show only the person's current name. However, if the company reassigns sales territories, users might need to track who sold what, at what time, and in what territory, necessitating a Type Two change.
Although most data warehouses include Type Two changes, you need to seriously examine the business need to record historical data. Implementing Type Two changes might be necessary, but those changes will increase the database size, degrade performance, and lengthen the development time. You need to carefully evaluate using a Type Two implementation, a Type One implementation, or a hybrid implementation.
Implementing the Types of Slowly Changing Dimensions
Although the three slowly changing dimension types are simple in concept, implementing them isn't trivial. The key to successfully implementing the various types of slowly changing dimensions is in the database design. The design of dimensional models accommodates changes to dimensions. Figure 1 illustrates a basic dimensional model with a single fact table and multiple dimensional tables, including a time dimension. Implementing the various change types is relatively easy in this simple design. For a Type One change, you find and update the appropriate attributes on a specific dimensional record. For example, to update a record in the SALES_PERSON_ DIMENSION to show a change to an individual's SALES_PERSON_NAME field, you simply update one record in the SALES_PERSON_DIMENSION table. This action would update or correct that record for all fact records across time. In a dimensional model, facts have no meaning until you link them with their dimensions. If you change a dimensional attribute without appropriately accounting for the time dimension, the change becomes global across all fact records.
Updating the SALES_PERSON_NAME field once and seeing the change across all fact records is efficient and makes good business sense if the update corrects a misspelling, for example. But suppose a salesperson transfers to a new sales team. Updating the salesperson's dimensional record would update all previous facts so that the salesperson would appear to have always belonged to the new sales team. If you want to preserve an accurate history of who was on which sales team, a Type One change might work for the Sales Name field, but not for the Sales Team field.
To solve the sales team problem, let's look at it as a Type Three change (because of the complexity, I'll look at Type Two changes last in these examples). To implement the Type Three change, you need to alter the dimension structure slightly so it looks like Figure 2. In Figure 2, a Type Three change adds two new attributesold and new Sales Team fieldsand renames one attribute to record the date of the change. Although this method might solve the sales team problem, a Type Three implementation has three disadvantages:
- You can preserve only one change per attributeold and new or first and last.
- Each Type Three change requires a minimum of one additional field per attribute and two additional fields if you want to record the date of the change.
- Although the dimension's structure contains all the data needed, the SQL code required to extract the information can be complex. Extracting a specific value isn't difficult, but if you want to obtain a value for a specific point in time or multiple attributes with separate old and new values, the SQL statements become long and have multiple conditions. Overall, Type Three changes can store the data of a change, but they can't accommodate multiple changes, nor can they adequately serve the need for summary reporting.
Prev. page  
[1]
2
next page