Executive Summary:
Occasionally, there are miscellaneous attributes, such as yes/no attributes or comment attributes, that don’t fit into tight star schemas. Rather than discarding flag fields and yes/no attributes, place them in a junk dimension. In addition, you can handle comment and open-ended text attributes by creating a text-based junk dimension. |
When you’re designing a data warehouse,
you’ll occasionally run into attributes from
the source database(s) that don’t fit into
neat, tight star schemas. We’ve all seen OLTP tables
that are full of flag fields and yes/no attributes, many
of which are used for operational support and have no
documentation except for the column names and the
memory banks of the person who created them. So
how should you handle open-ended text and comment
attributes, many of which are badly designed in the
OLTP schemas? Not only do those types of attributes
not integrate easily into conventional dimensions such
as Customer, Vendor, Time, Location, and Product,
but you also don’t want to carry bad design into the
data warehouse. However, some of the miscellaneous
attributes will contain data that has significant business
value, so you have to do something with them.
There are three conventional ways to deal with
these attributes: discard all of the miscellaneous attributes,
eliminating them from the dimensional design;
incorporate the miscellaneous attributes into the fact
table; or make each miscellaneous attribute a separate
dimension. However, all of these options are less than
ideal. Discarding the data can be dangerous because
the miscellaneous values, flags, and yes/no fields might
contain valuable business data. Including the miscellaneous
attributes in the fact table could cause the fact
table to swell to alarming proportions, especially if you
have more than just a few miscellaneous attributes.
The increased size of the fact table could cause serious
performance problems because of the reduced number
of records per physical I/O. Even if you tried to index
these fields to minimize the performance problems, you
still wouldn’t gain anything because so many of the
miscellaneous fields contain flag values such as 0 and
1; Y and N; or open, pending, and closed. (For more
information about indexing, see “Indexing Dos and
Don’ts,” January 2003, InstantDoc ID 27334.) And
if you make each miscellaneous attribute a separate
dimension, it will most likely result in a complicated
dimensional design. For example, a star schema for
a shop floor or manufacturing activity would include
the standard dimensions of people (e.g., customers,
vendors, employees), time, location, and inventory or
product, which is a tight, straightforward design for
a four-dimensional cube. But once you start incorporating
10 or 20 additional dimensions, each of which
is a yes/no attribute, a status field, or an open-ended
comment, you’re looking at a much more complicated
star schema and associated cube.
Determining the Value of
Miscellaneous Attributes
You can determine which miscellaneous attributes
have business value through a process called discovery.
You have to ask the right people the right questions to determine if these miscellaneous flag and text attributes
are truly useful, and you can’t be satisfied with only
one person’s opinion. For example, someone in the
production department might consider the open-ended
comments to be truly important, while someone in the
sales department might not think so. Your discovery
process has to be all-encompassing and thorough.
You must understand how the data in transactional
databases is used and for whom it has value before you
can determine if the miscellaneous attributes should
be retained or discarded. Here are my suggestions for
how to handle flag or comment attributes that have
business value.
Handling Flag Fields and
Yes/No Attributes
I’m sure you’ve seen many examples of flag fields and
yes/no data that have no documentation regarding how
they’re used. This scenario is especially common in
legacy systems and databases that were created without
solid, underlying design principles. Column names such
as Completed, Packed, Shipped, Received, Delivered,
and Returned (each with yes/no data values) are very
common, and they do have business value. Instead of
discarding flag fields and yes/no attributes, I suggest
placing them all into a junk dimension that’s organized
as shown in Figure 1.
The junk dimension shown in Figure 1 represents
an order-fulfillment system; the column headers
show some of the possible statuses an item that has
been ordered can have. Row 1 indicates that the item
ordered has been picked out of the warehouse, packed
for shipment, shipped, delivered, received, returned
for a refund, and restocked in the warehouse. Row
9 shows an item on order that’s waiting to begin
the order-fulfillment process. The rows in between indicate items that are in various stages of the orderfulfillment
process. This example is very simple
because the process is so linear and sequential, and
NULL conditions aren’t allowed in this transactional
database. As in any dimensional design, each of the
rows in the fact table will be associated with a row in
this junk dimension. Even if your set of miscellaneous
attributes isn’t as sequential or you have to create a
dimension that contains all possible combinations
of yes and no, there will still be only 256 records in
the entire dimension. As with all dimensional design,
don’t forget to add an identity column to the junk
dimension as a primary key and to include the junk
key column in the fact table as a foreign key.
Handling Comment and
Open-ended Text Attributes
You can handle comment data and free-form, openended
text fields by creating a special text-based junk
dimension. If the verbiage in these fields is potentially
valuable, create the text-based junk dimension with
two columns, the key column and the text column, as
shown in Figure 2. Include the text junk key column
in the fact table as a foreign key, and don’t forget to
add a “no comment” record to the text-based junk
dimension for those facts that have no associated text.
Most likely, these text fields in the source systems
will be used only sporadically, so the text-based junk
dimension will be much smaller than its associated
fact table. Figure 2 also shows how the yes/no attribute
junk dimension described earlier will relate to
the fact table.
Keeping Your Data Warehouse
Design Simple
You want to keep the data warehouse design as simple
and straightforward as possible, so that users will be
able to access data easily. Miscellaneous attributes
that contain business value are a challenge to include
in your data warehouse design because they don’t fit
neatly into conventional dimensions, and if improperly
handled, can cause the data warehouse to swell in size
and perform suboptimally. By placing miscellaneous
attributes into junk dimensions, you can circumvent
both of these problems.