DOWNLOAD THE CODE:
Download the Code 8030.zip

Here's how to incrementally update cubes—manually, programmatically, and with DTS

The capacity for an OLAP cube to store and organize colossal quantities of data is the attribute that makes OLAP more valuable for reporting than a standard relational database. Queries that take 20 minutes to return results in a relational database can return the same results in less than 1 minute when the system uses previously calculated aggregate tables. But if you have a cube containing more than a year's worth of sales measures, which are represented by more than 20 million records, adding new records to the sales fact table and processing the data cube can take several hours. It seems strangely inefficient to have to reprocess huge stores of old data that haven't changed just so you can incorporate a few records to the cube. You need a way to add those records to the existing cube in the few minutes it takes to process only the new records.

To make matters more complicated, let's say you need to add daily sales records to the cube automatically in a batch program every night. You need to incrementally and automatically update the cubes. This capability is critical to the successful implementation of an enterprise-level cube, yet it is one of the most under-documented and most easily misunderstood processes of OLAP Services. Using the example in which daily sales records are added to the cube, I'll show you how to set up the relational tables to facilitate incremental updates. Then I'll show how to manually update cubes incrementally and how to programmatically update cubes through the Decision Support Objects (DSO) COM interface. Finally, I'll explain how to do incremental updates with Data Transformation Systems (DTS).

Setting Up the Relational Tables
OLAP Services' ability to identify records in the relational tables that haven't yet been entered into the cube makes the incremental update possible. Without this ability, OLAP Services couldn't differentiate the new data from the old data. When you apply an incremental update without this step, all the cube's measures are counted twice because OLAP Services sums the new measures with existing ones.

To successfully implement incremental updates, OLAP Services needs to compare elements in the fact tables to data in the OLAP cube in a way that excludes data already in the cube. To demonstrate this method, I'll use a unique incrementing field in the fact table that will be aggregated with the MAX() column function. By using MAX(), you can write a simple MDX query that will return the largest ID number stored inside the cube. Any record containing a larger ID number in the underlying relational tables is a candidate for inclusion the next time the cube is processed. If your fact tables are populated with batch programs that can supply a unique batch number for every set of inserted records, then you could use that batch number instead. To use the following techniques, you just need to ensure that the records in the fact table contain a sequential numeric column that can be stored in the cube as a measure.

Adding a measure in the cube only because you need to manage incremental updates internally might not seem to be a best-practices strategy. But I find that if you don't add a measure, the relational tables will continually need to keep track of which records in the fact table were used to populate the cube. I explored two methods of using only the relational tables to differentiate the new records in the fact table from the ones already used to populate the cube. You could establish one fact table per partition, and write lengthy procedures that ensure the fact tables are populated correctly, but then you would need to do a full refresh on the individual partitions. Another, perhaps better, option is to represent the fact table with a view that contains a WHERE clause that limits the rows returned to those rows that are not present in the cube. You could set up this view by maintaining a separate ID tracking table that contains the last ID sent to the cube. Each time users updated the cube, they would also have to update the table to reflect the last ID sent to the cube. The view's results would be based in part on the last ID on that table. The problem with this method is that it's nearly impossible to ensure that the cube update succeeded before you update the ID management table on the relational database. Another problem is that the cube could get updated properly but the ID management table update could fail, which means that OLAP Services might repopulate the cube with existing data. I've found no acceptable way to wrap two, separate processes around a common transaction when one occurs in SQL Server and the other within OLAP Services. To ensure that even interrupted transactions leave both the OLAP cube and the underlying relational tables in a consistent state, you need to let the cube store the information about the records it contains compared to the records in the fact tables. You can best accomplish that by using the previously mentioned identity key as the measure.

Incrementally Updating a Cube Through the OLAP Manager
A cube is a structure that contains a hierarchy of smaller structures called partitions, which contain portions of aggregated data. Incremental updates need to interact with the cube at the partition level. Partitions are often described as slices of the cube because they represent parallel subsets of data. For example, you can divide a sales cube into partitions, each containing sales data for a quarter of the year. But because these partitions are part of the same cube, client queries don't have to account for this configuration. A cube comprises at least one partition, and might comprise tens or hundreds of partitions to optimize data access and processing. The key to successfully implementing multiple partitioned cubes is ensuring that each partition contains only the data pertaining to that partition. Otherwise, the cube might return incorrect data.

You can use one of two methods to ensure that multiple partitions never contain overlapping data. One method is to associate a partition with a separate fact table or a view that contains only the subset of data needed to populate that partition. This is a good method when you're creating a new partition for new data and you're designating the current partition to hold only archived data. With this method, the cube needs to reprocess only the new partition to account for any new data coming in. The second method is to place a filter on the partition, which means that the OLAP engine will process only the data it was meant to contain even though the engine uses the same fact table each time it processes the data. The latter method is most useful when you're doing incremental updates.

For the daily sales update example, you need to place a filter that returns only those records that have an ID value greater than the largest ID value in the cube. You can find this value in the cube's data browser. Because in this example, OLAP Services uses the MAX() function to aggregate the ID field, you can find the highest ID value by taking the highest-level aggregate of all the dimensions, which is the default level shown when you open the OLAP Manager Cube Editor, as Screen 1 shows. To place a filter on the current partition, select the cube you will be using and expand the tree under it to expose several subfolders, including the Partitions subfolder. When you click on Partitions, you'll see a list of the cube's partitions. Right-click on the partition that, by default, has the same name as the cube. From the dialog box, select Edit, and type in the WHERE clause, as Screen 2 shows.

From the Process a Cube dialog box, select Incremental update, as Screen 3 shows. When you select this option, OLAP Services will add to the partition only the records that match the filter. Note that if a user selects a full reprocess or a refresh while the filter is still in place, then the partition will be emptied of all records and OLAP Services will add to the cube only the records that correspond to the filter. So if you increment data manually, make sure you remove the filter from the partition after OLAP Services processes the cube.

Automatic Incremental Updates
Although you can use manual cube updates for occasional data refreshes, cubes often need to reflect daily or hourly data, which requires the ability to automate incremental updates. This process isn't difficult if you understand DSO and the details of cube structure. Here's some basic information: DSO is a COM interface that comes with OLAP Manager. In theory, you can access DSO from almost any scripting language you choose, such as VBScript, JavaScript, or Perl. But Microsoft developed DSO by using VB interfaces that require strongly typed variables, so using widely available scripting languages is difficult, requiring you to use many undocumented method calls and properties that Microsoft might not support in the future. For this reason, I recommend simply using VB to create an ActiveX DLL that encapsulates the functionality you need to manipulate the cube. Then, you can use scripting to make calls to this DLL.

You can learn more about DSO's basic structure by walking through portions of VB code, described below, that you need to accomplish each step of the tasks. Note that for simplicity, I've omitted error handling and hard-coded values in these examples.

The topmost object is the server, which you instantiate in the following manner :

Dim dsoServer as DSO.Server
Set dsoServer = new DSO.Server
   Prev. page   [1] 2 3     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

I want to commend the author on the quality of his writing and the usefulness of the article.

John Harrison

Great article. I just didn't understand at what poing we are updating BATCHDATE field in SALESID table when creating DTS package. Thanks a lot. Marina. mmaziya@partners.org

Anonymous User

Article Rating 5 out of 5

The article is excellent. Thank you very much for publishing it.I have implemented this and it works perfect. Marina - I have put getdate() default value for BATCHDATE field and it always updates the filed with the date and time when the cube has been processed for the last time.

Anonymous User

Article Rating 5 out of 5

Extreme useful article.

spalding

Article Rating 5 out of 5

 
 

ADS BY GOOGLE