Executive Summary:
You'll need to plan carefully when dealing with monster dimensions because they consume large amounts of maintenance time and CPU cycles. You can manage rapidly changing monster dimensions by splitting the stable and hot attributes and then linking them through a fact table. You can also manage rapidly changing monster dimensions by banding the hot attributes, which is where you create a set of non-overlapping value ranges for each rapidly changing attribute. |
Monster dimensions require special handling
and advanced planning, especially when
you’re dealing with rapidly changing monster
dimensions. Rapidly changing dimensions (i.e., dimensions
whose content changes as a result of input errors
or changes over time that you’re recording in place) aren’t
too much of a challenge when the dimensions are small.
However, monster dimensions consume large amounts of
maintenance time and CPU cycles because of their size, so
you need to plan carefully when you’re dealing with them.
Let’s explore two different methods for managing rapidly
changing monster dimensions: first, splitting the dimension,
and second, creating a set of non-overlapping value
ranges for each rapidly changing attribute and then creating
rows that include all possible range combinations.
Split the Dimension
You can manage monster dimensions by separating the
stable attributes from the attributes that change frequently
(i.e., the “hot” attributes), and then link the split dimensions
through a fact table. Doing so will shorten the length
of the records that are being changed as the hot attributes
are modified. This vertical partitioning offers great
flexibility. You get better performance because
you’re not reading and writing all the customers’
data all the time. One way to manage huge files,
such as monster dimensions, is to define multiple
data files per file group and to spread the data files
across multiple disks. By doing so, you’ll exploit
Windows’ native parallel I/O, which optimizes disk
reads and writes.
Figure 1 shows a split monster dimension.
At one time, all the attributes were in one large
Customer dimension. But the number of rows
grew so large that updates and retrievals became
cumbersome, so the dimension was split into the
Customer and CustDemographics dimensions.
Customer contains stable and slowly changing
data (customers’ names and addresses). Note that each customer record includes a current address and a
previous address, which is an example of creating an “old
value” column in the dimension record to store the previous
value. (For more information about creating an old
value column in the dimension, see “Data Warehousing:
Slowly Changing Dimensions,” January 2008, InstantDoc
ID 97409.) Here, this approach lets you deal with the
slowly changing address information.
In this case, we’re not interested in maintaining an
address history, but rather in how a change of address might
affect the customers’ buying habits. CustDemographics
contains all the rapidly changing attributes (i.e., income,
education, family size and status, credit, and purchases) that
correspond to factors influencing customer buying habits.
As with all dimension tables, Customer and CustDemographics
will each have a surrogate primary key. Unlike
in transactional database design, however, there’s no direct
relationship between Customer and CustDemographics.
Instead, Customer is related to CustDemographics
through a fact table, as shown in Figure 1. When you relate
the two dimensions through a fact table, you’re adhering
to a single-level star schema, which is crucial when dealing with very large data warehouses.
I don’t recommend snowflaking
the design or creating a direct
one-to-many (1:M) relationship
between two dimensions because
inter-dimensional relationships
complicate the scenario and make
it more difficult for business users
to traverse the schema.
Band Hot
Attributes
One downside to splitting a dimension
is that if you’re adding a new record to CustDemographics
whenever change occurs, you’ll have to edit and
update both CustDemographics and the fact table. Each
person in the Customer dimension will have exactly one
corresponding row in the CustDemographics dimension,
so the relationship between each dimension and the fact
table will be 1:1, instead of 1:M. You’ll also be wrestling
with not one but two monster-sized tables.
There are two ways around this problem: You can overwrite
the values in CustDemographics as change occurs, or
you can “band” the values in CustDemographics. (Note:
Use this approach only if you don’t care about the history
of change.) Banding an attribute means creating a set of
non-overlapping value ranges for each rapidly changing
attribute, and then creating rows in the CustDemographics
dimension that include all possible range combinations.
For example, if the values for Cust_Income (in Figure 1)
were banded rather than discreet, you would see boundary
values, such as $20,000 or $40,000, instead of discreet numbers
(e.g., $75,357). You pre-populate CustDemographics
with all possible combinations of these banded values, but
you can predict how many rows will be in the dimension.
If you have six attributes and each attribute is limited to 10 possible values, then the maximum
number of rows in the dimension will be
106, or 1,000,000 rows. Now the relationship
between each dimension and the fact table is
1:M, and you can capture change as it happens,
within the limits defined by the banded
values. When a customer’s income increases
or decreases enough to move it from one
income band to another, then the change is
recorded in the fact table.
Figure 2 shows banding. Note that the
boundary values (e.g., $20,000, $40,000)
are “range right” (i.e., a value equal to
$20,000 would be in the $20,000–$40,000
range). Banding causes CustDemographics
to become a static list of values. Now, instead
of having to edit or update both CustDemographics
and the fact table, you have to edit or
update only the fact table.
Also, banding reduces the fact table’s
size. In the unbanded case, the relationship between CustDemographics and the fact table is 1:1, so for every row
in CustDemographics there’s a corresponding row in the
fact table. With banding, although the CustDemographics
dimension is larger (because it contains all possible combinations
of banded boundary values), not every combination
will be used, and not every change in a customer
demographics attribute will be recorded in the fact table.
The fact table contains an entry for each customer
during the first recorded time period—in this case, the
year 2005. However, you can expect some customers
to not have yearly entries in the fact table after the first
recorded time period. For example, you can see in Figure
2 that CustKEY 1 (Abby) has an entry for 2005 and 2006,
but not for 2007. The implication is that the changes that
happened to Abby in 2007 weren’t enough to change her
demographics from 2006.
One drawback to banding is that by restricting fact table
sampling to ranges instead of discreet values, you can miss
fine details. Also, once you’ve chosen the boundary values,
you’re stuck with them. You can combine two adjacent values
(i.e., make a wider band), but you can’t create smaller bands
and stay consistent with data you’ve already collected.
Be wary of creating too many bands because the total
number of rows in the banded dimension is the number
of boundary values raised to the power of the number of
attributes you’re banding. Consider splitting CustDemographics
even further. Dimensions such as income/education
and family demographics offer even more flexibility
than the original CustDemographics dimension.
If you’ve been following the discussion thus far, you’re
probably already asking “Isn’t this demographic fact table
going to be monstrously large?” The short answer is yes,
and next month, I’ll discuss how to partition monstrously
large fact tables.