Executive Summary:
With slowly changing dimensions you can respond in one of three ways. You can overwrite the old data with the new data, which works best when the old data has no significance and can be discarded. You can create a new dimension record for the time period that defines the change, which works best with changes that neatly partition history. You can also create an "old value" column in the slowly changing dimension to store the previous value, which works best with soft changes.
|
Change is inevitable; nothing stays the same for
long. That’s why such careful precautions are
taken to minimize the impact of change in
transactional database design. We choose primary key
candidates, which have the least possibility of being modified,
because we know the complexity that’s involved in
changing a primary key value in a typical OLTP database.
Change even happens in data warehouses, especially in
data warehouses whose sole purpose is to archive historical
data. In fact, historical archiving mandates that you
have a scheme in place to capture changes that will happen
to people, places, and things as time passes.
In data warehousing, we’re faced with developing
templates or standard ways of defining and addressing
content changes in the data warehouse. You’ve probably
heard of slowly changing dimensions, which are dimensions
whose content changes over time. I’m not talking
about just adding records to the dimension; I’m talking
about changing existing records. Although this seems
counterintuitive in a data warehouse environment, it’s not
an unusual situation, and it’s often resolved by modeling
the change on a sequence of templates. Which template
you use depends on the kind of slow change your data
warehouse dimension content is experiencing.
Your Response to Change
When change happens, how will you react? With slowly
changing dimensions, you can respond in one of three
ways: you can overwrite old data with new data (I call
this a type 1 response), you can create a new dimension
record for the time period that defines the change (I call
this a type 2 response), or you can create an “old value”
column in the dimension record to store the previous
value (I call this a type 3 response). The response type you
choose will depend on your organization’s requirements
for maintaining and archiving data.
Type 1 Response: Overwrite the
Old Data Value
This response type involves overwriting the old data value
when the old value has no significance and can be discarded,
such as when you’re correcting an error. Typically,
you’ve written the extraction, transformation, and loading
(ETL) routines correctly, so there shouldn’t be any errors,
but despite our best efforts, mistakes happen sometimes.
Before you decide to respond to a slowly changing
dimension by overwriting the old data value, make
sure that you can discard a previous value. Some
organizations, such as financial and audit organizations,
are so bound by compliance requirements
that overwriting the old data value isn’t an option.
Type 2 Response: Create a New
Dimension Record
This response type is useful when you have audit trails and
other changes that neatly partition history. For example,
prior to 12:14:55 a.m. the flow converter was running at 95
percent capacity; after this time it was operating at only 50 percent capacity. Adding a record to capture this change in
flow rate is the technique most often used to track measurable
values, such as employee job performance, customer
status, and product availability. To effectively partition
time so that there’s no overlap in the periods measured,
add an EffectiveDate (i.e., the date/time this change first
appeared) column and an EndDate (after which date/time
this value is no longer valid) column to each record, as
shown in Table 1, page 27.
Note that in Table 1, the current condition (i.e., the
current job description) is represented by the row in which
the end date is NULL. As you recall from transactional
database design theory, NULL can mean that at the time
the measurement was taken there was no value; that there
might have been a value, but at the time, no one knew what
that value was; or that the measurement wasn’t applicable
to this record. In this case, there’s no value for the end date
on record number four because that record represents the
current situation.
Before using a type 2 response to change, make sure
that the data mining software your company is using
can understand and compensate for NULL—not every
product can. If you determine that you can’t use NULL,
leave the end date as an empty string (which SQL Server
2005/2000 will convert to a value of 1/1/1900 if you’re
using a datetime data type) and add another column called
CurrentRecord that can be flagged on or off.
Don’t confuse the dates in this slowly changing dimension
with the dates in any associated fact table. The dimension
dates are used to partition sets of records into time
slices. It’s not appropriate to join these dimension dates to
the dates in a fact table.
When you’re querying a type 2 dimension and the query
keys on an identifier that represents just a single time slice,
then you can expect only a single record to be returned. If
you query attributes that aren’t part of the slowly changing
dimension criteria, multiple records will be returned.
When employing a type 2 response, you must use a
surrogate (e.g., the SQL Server IDENTITY property) for
the primary key because you’ll have multiple versions of
the same record as that record changes over time. I don’t
recommend trying to use a meaningful natural key concatenated
to a date/time value as a dimensional primary key.
Type 3 Response: Create an Old
Value Column
“Soft” changes (i.e., tentative changes or changes resulting
because you want to track a single attribute value for a
limited time—examples of soft changes might be sales
regions or geographic areas) lend themselves to a type 3
response. In this design, you create an additional column
in the slowly changing dimension for the old value. When
change happens, just copy the current value into the old
value column (e.g., Old Discount Percent in Web Table 1, InstantDoc ID 97409) and insert the
new value into the current value column. There’s no need
to create a new dimension record. The type 3 response is
best used when the time slices induced by change aren’t
clear-cut (in other words, the change was gradual) or
where the time slices overlap.
The benefit to using a type 3 response is that all
production applications will continue to return the most
current results, with no need to modify any queries. You
can access the previous value by creating a new query that
refers to the old value column. If you want to incorporate
a limited history, add another column called Change Date
with a default value of getdate() or current_timestamp, as
shown in Web Table 1. Then you can not only access the
previous value, but also see when the change occurred.
The problem with using a type 3 response is that when
you have more than one soft change per dimension, the
table schema begins to get very messy, especially if you’re
tracking when each change happened. The row size in bytes
can quickly grow, adversely affecting file I/O and adding
to the length of time necessary for query responses and
joins. If you have multiple soft changes in one row of the
slowly changing dimension, then you must strictly adhere
to strong naming standards for the old value and Change
Date columns. Failure to do so could result in confusion
and erroneous data on your reports. This response type is
best used with a single soft change for each dimension.
Rates of Change
This discussion of slowly changing dimensions wouldn’t
be complete without mentioning the rates of change.
Although these dimensions are called slowly changing
dimensions, they don’t always change slowly. Even more
important, slow change to one organization could be rapid
change to another.
When you want to capture change in a dimension, and
the change is in disjointed time slices, then it’s usually best
to create a new column to record the change for dimensions
that are a “normal” size, whether they’re changing
slowly or rapidly. I say “usually” because it’s very hard (if
not impossible) to set absolute rules for designing data
warehouses; sometimes you simply have to look at the
specific situation and make a judgment call.
What constitutes a “normal” size for a dimension?
Typically about 300 to 100,000 records. For example,
a company might have 100,000 employees in its HR
warehouse, and a value in each employee record might
be modified once a month. Even if the employee record
is long (e.g., 2000 bytes), it’s still a good candidate for the
type 2 response mentioned above.
Rapidly changing dimensions aren’t too much of a
problem when the dimensions are small, but how do you
handle rapidly changing monster dimensions? Rapidly
changing monster dimensions warrant their own article
because each design technique is moderately complicated
and has drawbacks. Next month, I’ll discuss how to
manage rapidly changing monster dimensions.
End of Article