Executive Summary:
|
Knowing the basics of data warehousing and dimensions helps you design a better data warehouse that fits your reporting needs. This primer on data warehousing dimensions explains the importance of dimensions and dimension granularity and stresses the importance of flattening hierarchies—with the goal being to make data more accessible and useful to users.
|
I covered the basics of the fact table and how it
fits into the dimensional model in "Data Warehousing:
Measuring the Facts," September 2007,
InstantDoc ID 96336. Now let's address the fundamental
design of the dimension - the "arms" of
the star schema. Dimensions are tables that contain
textual explanations of the coded and key values
that are present in the star schema's fact table. The
topic of dimensional modeling encompasses a large
body of work, so if I miss something that you think
should have been included, it's probably for lack of
space and not for lack of interest or importance. You
can contact me directly or join the forum discussion
(go to http://www.sqlmag.com/go/dbdesign). That
way the topic will get on my editorial calendar.
Defining Dimensions
Dimensions are a common way of analyzing data.
In the article above and in "Discover the Star
Schema," July 2007, InstantDoc ID 96112, I implied
that the dimensional model is user-driven; now I'm
going to say it loud and clear: The dimensional
model is laid out for ease of use and to be logical and
understandable to business users. Dimensions
define the dimensional model and spell out
the who, what, where, when, why, and how of
the situation. They add context and meaning
to the measures contained in the associated
fact table.
In a dimension table, there is one row per
product, one row per customer, one row per facility,
one row per time unit, or one row per geographic region, and the granularity or level of detail of
these rows depends on how finely detailed the business
user wants the data. For instance, is it enough
to know the average dollar value of each sale? Or
do you need to know the average dollar value of a
sale for each day of the year? Do you need to know
how much of the sale value was for perishable itemsversus nonperishable items? Do you need to know
purchases down to the line item level?
It's very important to understand what the business
reporting needs are before you start designing a
data warehouse. You simply cannot be a cowboy and
start slinging data around with data warehousing
and dimensional design - there's too much at stake:
Too much data is involved in a data warehouse, the
project visibility is too high, and it's too costly to
undo careless mistakes. You must plan. You must
know how granular the level of detail needs to be.
Structuring Dimensions
At this point, if you're thinking "cube," think about
the flat surfaces of the cube - those are the dimensions.
Dimensions control query filtering - for
example, "...where region like West" in T-SQL or
"WHERE (Region.West)" in MDX - and supply
almost all labeling in the output result sets that are
turned into reports. Dimensions provide the "by"
words for a report - "sales by month by product,"
"inventory by category by region." When you think
about it, dimensions are the natural way a business
user would talk about business.
Each dimension contains data from a single
domain, such as the time, product, or geographic
area domain. Dimensions act like decoders when
you "flatten" the lookup table hierarchy, as Figure
1 shows. If the three tables on the left were part
of an entity relationship diagram (ERD) for an
online transaction processing (OLTP) database,
Product_Category and Product_Subcategory would
be lookup tables in a hierarchy, with Product_Category
higher in the hierarchy than Product_Subcategory.
The purpose of a lookup table is twofold: to
enforce domain integrity by limiting the list of codes
that can be used in the modified table, and to help
reduce data redundancy by adhering to a production
level of third normal form.
In a dimensional model, you'd
consider flattening the hierarchy of
products, categories, and subcategories,
as you can see in the Product_
With_Category table. Each row
in the Product_With_Category
dimension contains all the category
content that's relevant for that
row, in addition to all the product
information. You don't need the
one-to-many relationships of an
ERD because you shouldn't have
to worry about domain integrity.
The data has already gone through
the extraction, transformation, and
loading (ETL) process and has
(presumably) been scrubbed and
validated before loading into the
data warehouse. But what about
redundancy? With a data warehouse,
the goal is not to reduce redundancy - data
warehousing involves a different mindset from the
transactional, operational database. If you're going
to be building multidimensional cubes, you're better
off flattening the hierarchies than not.
Flattening hierarchies also makes the join paths
between the dimensions and their associated fact
tables much simpler, resulting in better performance
and faster query results (we hope). The one-to-many
relationships between the dimension and its associated
fact table, illustrated in Figure 2, page 34, aren't meant
to protect data integrity. These relationships are meant
to associate keys in the fact table with the expandeddefinitions, which are found in the dimensions.
In any database there are hierarchies. Geographic
hierarchies occur where zip codes aggregate into
counties, counties into states, states into countries.
With temporal hierarchies, hours aggregate into
days, days into weeks and months, months into
quarters, and quarters into years. In spatial hierarchies,
rooms aggregate into buildings, and buildings
into campuses or city blocks, and so on.
Hierarchies such as these and the product
category-subcategory hierarchy in Figures 1 and
2 are used in reporting and summarizing results.
Depending on what information users require, they
might use the hierarchies differently from other
users. One of the most common scenarios is the
dichotomy between marketing and manufacturing:
What manufacturing considers one product might
be many products to marketing, as when hospital
scrubs were co-opted by young adults as the latest
fashion statement. When generating reports, manufacturing
might not want to use product subcategories,
whereas marketing would.
The flattening example in Figure 1 is the type of denormalization you'll often see in a dimensional
model. We denormalize to avoid an excessive
number of join operations. In future columns I'll
examine dimensional denormalization more closely,
but for now, just remember to denormalize in such
a way that you lose nothing except the OLTP structural
complexity.
Dimensional Granularity
In my September article cited above, I explained
that fact tables have "grain," or levels of detail. It
turns out that dimensions also have grain, and the granularity of a dimension depends on the reporting
requirements - just like the fact table's granularity.
Unlike any other database modeling scheme, dimensional
modeling is truly customer-driven.
How fine-grained do you want your dimensions
to be? Obviously, the level of detail of the dimension
table has to be consistent with the level of detail of
the fact table that it modifies. If you have a fact table
that tells you that customer A bought product B at
store C in region F on date D and paid for the purchase
using E, and that the sale was credited to sales
associate G, then all those dimensions, A through
G, had better be able to define and describe the A
through G identifiers in the fact table.
One school of thought holds that a data warehouse
should contain only summary data and that all
available detail data should be stored in the associated
OLTP source databases. That approach might work
in some cases. However, I believe that fact tables and
dimensions are an archive of historical activity and
should contain the finest level of detail that you'll ever
need for a report. One of the largest data warehouses
in the world belongs to Wal-Mart - it's measured in
petabytes (thousands of terabytes) and holds one
and a half years of data at the individual sale level.
If you need to be able to analyze individual sales
in your organization, you'll need the finest level of
granularity. It's better to create a data warehouse that stores data with a fine level of detail and not need that
detail, than to start with summarized data and then
realize you need more detail.
Understandability and
Performance
The twin properties of a data warehouse are understandability
and performance. By flattening hierarchies and not losing any data in the process, you make
the dimension more understandable to the people
who use the data. What can you do to enhance performance?
You guessed it - flatten the hierarchies!
One of the most common user activities in a
dimensional model is drilling down to get more
details. Drilling down means adding more row
headers to the result set. Drilling up simply takes
away those row headers. You could drill down
through the product category into a specific product
subcategory until you found all the products in that
subcategory, and never leave the Product_With_Category
dimension.
The Sales_Territory dimension in Figure 2 is
an excellent example of a dimension suited for
drilldown. You can start with the Americas (zone),
move on to the United States, Western region, Colorado,
Denver County, 80208, and do all that with a
single table, meaning no joins (at least not until you
want some sales facts that relate to zip code 80208).
Sales_Territory is a richly defined geographic dimension
with a flattened hierarchy.
Flattening hierarchies is the same as eliminating
snowflaking in the star schema. Hierarchies
expressed as separate tables (such as the three tables
on the left side of Figure 1) create snowflakes,
which complicate the user presentation and often
intimidate users, causing them to shy away from
your data warehouse. Snowflaking (normalizing the
hierarchy) causes most data warehouse applications
to run relatively slowly because of all the joins in the
background, lowering the quality of the user experience.
The savings on disk space that you achieve
with snowflaking isn't worth it. For more discussion
about snowflaking and disk space, see the sidebar
"Are You Really Saving Anything By Normalizing?"
page 33.) And finally, snowflaking defeats
the purpose of bitmap indexing, which SQL Server
Analysis Services uses internally to optimize query
response. To learn more about bitmap indexing and
when it's used, see the Web sidebar "What's a Bitmap
Index?" (http://sqlmag.com/Article/ArticleID/96848/sql_server_96848.html).
Dimensions Illuminate Facts
When you analyze data in a data warehouse, you
start from a dimension. Dimensions add meaning
and context to the fact table in a dimensional
model and make the data accessible in an intuitive
way. Because there's so much more to say about
dimensions, I'll continue this discussion in a future
article. Until then, you can go to the Forum link at
the beginning of this article to communicate with
me and other readers about the fascinating topic of
dimensions and data warehousing.