• subscribe
June 16, 2009 12:00 AM

Lessons Learned from Poor Data Warehouse Design

These business and design best practices can prevent costly data warehouse mistakes
SQL Server Pro
InstantDoc ID #102119

To avoid these islands of influence, get your strategy map in place. A strategy map links the long-term goals and objectives of an enterprise with its operational activities, illustrating the cause-and-effect relationship between different Key Performance Indicators (KPIs) that you’ll see on a balanced scorecard and possibly on dashboards. Then, make reports as visual as possible—a picture is truly worth 1000 words.

Reduce Operational Costs
It’s important to make your data warehouse environment easy and cost-effective to support and maintain. Design implementation costs can quickly get out of control. It’s not unusual to have an 80 percent/20 percent division in cost between the back-end ETL and table building processes and the front-end applications and reports. Reduce design implementation costs by creating a set of reusable designs and repeatable ETL processes so that you don’t have to start each data warehouse/BI project with a clean slate. Simplify the design by conforming dimensions and standardizing the granularity of the fact tables wherever possible; do not over-engineer the tables. Overly complex designs can quickly drive costs to the point where the entire project fails.

Make Your Data Warehouse Scalable
Your business is going to change over time, so you need to make sure your data warehouse/BI environment can, too. According to the Gartner Group (Gartner RAS Core Research Note G00161231, James Richardson & Bill Hostmann, 12 September 2008), during the first year of a data warehouse/BI implementation, you should expect users to request changes that will affect 35 percent to 50 percent of the application functions. Anticipating and managing change will be one of your biggest challenges. A good change-enhancement request system should be at the top of your wish-list so that you can better manage the changes that are inevitable with a data warehouse/BI implementation.

In addition, hardware and software costs need to be balanced with scalability. The enterprise will change and so will your data warehouse/BI environment. If the data warehouse isn’t a core competency or the primary line of business, then you should seriously consider running your data warehouse in the cloud. At the time of this writing, Microsoft doesn’t have a viable cloud computing solution for large systems, but there are other options available, including GoGrid, Amazon/Pentaho, and VMware’s newly-released vSphere 4 cloud OS. With the right choice of cloud vendor, you can end up hiring expertise in business continuity, system availability and security, data storage and archiving, and scaling the data warehouse/BI environment as needed, all for a pay-as-you-go price that generally equates to one or two magnitudes decrease in implementation and maintenance costs compared with doing it yourself. Your time-to-terabyte will be measured in days rather than weeks or months.

Comply with Internal and External Standards
Having an enterprise overview of how and where data is used makes figuring out if you’re in compliance much easier. Compliance is another instance in which MDM becomes a “must-have” component in your organization. Although complying with both internal and external standards seems like a big headache prior to a security breach or event, you’ll be glad you did in the event of an attack. Every organization has de-centralized data management, so I recommend using compliance regulations such as Sarbanes-Oxley to level the playing field. An outside influencer can often do more to predicate cooperation than internal directives. Measuring compliance can be a tough and time-consuming job, so you might want to look into risk management software such as RiskWatch, which was originally built to track HIPAA compliance but now includes features that evaluate total enterprise risk.

Data Warehouse Design Tips
It’s not possible to describe or even list all the technical design issues that one could encounter when building a data warehouse. However, the following are some data warehouse design tips that can help you minimize problems in your data warehouse:

  • Don’t confuse measures with context. Measurements (measures) are usually numeric, can be aggregated, and are stored in fact tables, whereas context is descriptive and open-ended. Divide the context into logical groupings such as customers, store locations, and time, and store them as dimensions. (For more information about measures, see "Data Warehousing: Measuring the Facts," September 2007, InstantDoc ID 96336.)
  • Define the grain of the fact table. To do so, start by determining exactly what a fact table represents: For example, is it an individual insurance policy transaction or a weekly inventory count? Define the grain of a fact table in terms of a clear business objective and a set of business rules. Don’t be afraid to store too fine a level of detail if you need to because although you can always aggregate the details, you can’t extrapolate details from summarized data.
  • Don’t snowflake when a simple star schema will do. The purpose of data warehouse design isn’t to normalize the data (as you attempt to do when you snowflake) but rather to organize the data in such a way to be intuitively meaningful to end users. That often means you need to de-normalize the dimensions. (You can learn more about dimensions by reading "Data Warehousing: Dimension Basics," October 2007, InstantDoc ID 96846.)
  • Partition fact tables on a date key. Partitioning fact tables on a date key makes data maintenance much easier. If you need to delete or archive historical data you can “retire” the oldest partitions. Index maintenance is faster if the data is partitioned, as is data retrieval because SQL Server has to look only in relevant partitions for data. In addition, always align the partitioned indexes by using the same partition schema that you used to partition the base table. If you partitioned the base table on a date column, then index the date column and use the same partition scheme for the index. Keep empty partitions at both ends of the table so that it’s easy to add and remove partitions, and choose the partition grain carefully because there’s a maximum of 1000 partitions in SQL Server 2008 and 2005. (To find out more about partitioning fact tables, see “Data Warehousing: Horizontally Partitioning the Fact Table,” April 2008, InstantDoc ID 98007.)
  • Cluster fact tables on a date key. Doing so lets you efficiently retrieve historical slices of data and supports queries that are used to populate cubes—most of which have a major time component. (For more information, see “Indexing the Data Warehouse,” August 2008, InstantDoc ID 99330.)
  • Query the star schema using WHERE on the partitioning key. For instance, if the fact table (Sales) has been partitioned by date (dateKEY), your query would be look like
    SELECT… WHERE Sales.dateKEY between ‘2009-01-01’ and ‘2009-01-31’

    Because SQL Server 2008 and 2005 Developer Edition and Enterprise Edition are partition-aware, you’ll want to use literals, not variables, to point the query analyzer to the correct partition and to optimize SQL Server 2005’s idiosyncrasies regarding multi-partition query scans.

Data Warehouse Design Best Practices
There are so many lessons still to be learned about designing data warehouses. From these lessons we glean both business and technical best practices. As you discover new, and possibly better, ways to perform these tasks, spread the wealth by sharing your solutions with your peers.



ARTICLE TOOLS

Comments
  • Johan
    3 years ago
    Aug 18, 2009

    When reading this article i disagree about the data-quality. A datawarehouse should capture 'as-is' data which relates and is auditable to the source.

    The quality system, in my opinion, is a subsystem which relates back to every record inserted in the DWH. I tend to use the following in practice:
    - use KPI's on reports which indicate the quality
    - statistics about anomaly detection and other things reside in the subsystem and relate back to the DWH data at granular level: you're able to alert people on quality issues but never block it but correct afterwards instead via the source system

    In this way DWH and source remain synchronized.

    I think the way it is described in this article promotes the DWH as a cleaner system which bumps bad data: because people aren't wired in a natural way to the DWH , using the DWH in such a way is much too ambituous, and a true career killer for the DWH Administrator.

You must log on before posting a comment.

Are you a new visitor? Register Here