Leverage repository technology to give users the best business information possible
Editor's Note: "Using the Microsoft Repository" and "The Open Information Model," page 42, by Patrick Cross and Saeed Rahimi, and "Incremental Updates in OLAP Cubes," page 47, by Claude Seidman, continue our series on data warehousing and Incremental Updates.
In recent years, data warehousing best practices have included the use of a repository to store information about the warehouse. The repository information helps people assess the impact of changes, aids in tracking down problems, and helps warehouse users better understand the data they use to make key business decisions. The Microsoft Repository is an important part of Microsoft's data warehousing strategy. By following these examples, you'll understand and use the repository more effectively.
What Is a Repository?
A repository is a storage place for meta data, or data about data. Several vendors have created repository products in the past 20 years, starting with data dictionaries in the 1960s. More recent products have evolved from these simple dictionaries into complex systems for storing data from hundreds of applications and environments. Most recently, Microsoft included its repository technology as part of its data-warehousing framework (a collection of tools provided with SQL Server), in recognition of the important role repositories play in warehousing. Microsoft uses Extensible Markup Language (XML) to integrate support for exchanging information in the repository. XML provides a simple way for tools to interchange meta data. Interest in repositories has recently surged, primarily because of the emergence of data warehousing, knowledge management, and enterprise application integration.
What Is Meta Data?
According to Bill Inmon, meta data is "the description of the structure, content, keys, indexes, etc. of data," (Managing the Data Warehouse, John Wiley & Sons, 1996). More specifically, in a data-warehouse environment, meta data can be information about warehouse data, information about how to get a piece of data out of the warehouse, or information about the quality of data in the warehouse. Meta data can even give information about how to run warehouse tools to perform different tasks. Meta data about a warehouse includes information about systems, processes, source and target databases, data transformations, data cleansing, data access, data marts, and OLAP tools.
For example, a warehouse's meta data can tell you how an OLAP tool calculated a report's Total Sales column. Without knowing anything else about the warehouse, a user looking at a regional sales report might think that Total Sales includes all discounts and shipping and handling charges. But Total Sales might not include any or all of these elements. To help the end user, the warehouse needs meta data that ties the reports' columns to data transformations, data queries, field calculations, and source database tables and columns. The warehouse can provide this help if it has meta data about the source databases (such as tables and columns), what information was extracted, how the information was transformed, the target (or warehouse) database, data warehouse reporting tools, and database design and modeling tools.
Many warehouse implementations today use a simple spreadsheet to capture the source and target mappings and conversions. But beyond the project's initial requirements phase, this information quickly becomes outdated and inaccessible to most users who interact with the warehouse.
Why Is Meta Data Important?
Meta data acts as a road map to the information in the warehouse. It helps users find out what type of data is in the warehouse, what a piece of data means, how to access the data, who owns the data, and who created it and when and how. Without meta data, data warehouse users can access data but not information in context that helps them make business decisions with confidence. Currently, the best practices for warehouse and data-mart development mandate having a meta data strategy that makes the warehouse easy to update and use.
Meta data can provide complementary information about the warehouse's contents. For example, an analyst can use meta data to better understand what information is available and how it is calculated. Meta data provides a detailed analysis of where the information came from and can give a confidence factor for describing the data's validity.
Meta data can also maintain a history of activity before the current processes. This history is important in understanding how the structure and processes change over time. If a user wants to compare sales by product category during the past 2 years, but the product category was added this year, the results of the comparison are meaningless.
Also, meta data provides a common frame of reference, acting as the organization's central knowledge base. For example, everyone who looks at the regional annual sales report can arrive at the same understanding of what Total Sales means.
The ability to analyze the impact of changes in the warehouse can drastically reduce the cost of ongoing maintenance. If you need to change a source table, you can simply look at the table's other columns to determine which parts of the warehouse use those columns. With this usage information, you can make a quick list of items that you'll need to modify to accommodate the source system changes. Likewise, you can more easily fulfill requests for additional information in the warehouse if you can see where information is already coming from. Before adding new data to a table, review all the data sources for the table.
Meta Data Services
Repository technology is the core of the services that SQL Server provides for managing meta data. These services' features bring significant value to information consumers and people who maintain the repository data. The most important features are information models, the Microsoft Repository engine, XML interchange, and extensibility.
Information models. Information models define the meta data in terms of object types and their relationships. The information model is thus the language for describing the information the repository will store. Microsoft's data-warehousing consortium (for more information, see "The Open Information Model," page 42) chose the Unified Modeling Language (UML) for documenting and communicating the information model the Microsoft Repository uses. Although users can define their own information models to describe meta data, Microsoft created the Open Information Model (OIM) as a common specification for storing information about systems, warehouses, etc.
Repository engine. The Microsoft Repository engine handles meta data storage and retrieval. The engine uses the information model and stores meta data instances that the information model describes in SQL Server tables and columns. And the engine uses caching to optimize access to the meta data. Versioning is an important feature of the Microsoft Repository, and is fairly new to repositories generally. Versioning means that as updates happen to objects in the repository, those changes are captured, and the repository maintains a history of how the repository data changed. By letting you view objects as they're updated, the repository can answer questions about changes over time. For example, if new columns are added to the warehouse, you can track and query this addition later. This ability helps you solve problems related to warehouse alterations that cause sudden changes to data the users receive.
XML interchange. Besides describing the data types that the repository can store, the information model includes an XML format for interchanging meta data between the repository and other tools. And users can easily import information into the repository from custom sources or other tools. One example in "Sample Repository Applications" (which you can download at the link to this article at http://www.sqlmag.com) contains a sample XML file that you can use to load data into the repository.