The file system's organization is naturally hierarchical, and I copied this structure in the FileDir dimension table. The structure is unbalanced because not all branches of the hierarchy contain the same number of levels. Because of this imbalance, I used a parent-child dimension type for the Files dimension. Parent-child dimension tables use a self-referencing foreign key to describe the relationships between parent and child dimension records. In nonparent-child dimension tables, columns represent dimension levels. The Extensions dimension doesn't have a natural hierarchy, so I used a flat, one-level dimension with a second-level All member for it.
Next, I needed to decide how to load the desired measures (i.e., Age, Size, and Count) from the fact table. This process can be tricky because multiple records might represent only one file in the fact table. Although each of the multiple records represents the same file, the size and age might differ between records because each record is from a different snapshot. To simplify the process, try imagining how the aggregations in the time hierarchy need to work. Figure 3 shows a simple representation. If the fact table contains a snapshot of a file's age and size on each day of the week, how do you determine totals for the week? One answer is to use the age and size values for the last snapshot in the time period. Similarly, in inventory analysis, you always roll up values (e.g., units, dollars) from the last day of a time period.
Analysis Services offers no native aggregation types that support this "inventory"-style aggregation, so I set up a custom aggregation. I could have used calculated members or calculated cells to solve this custom aggregation problem, but I decided that using custom rollup formulas was the most straightforward approach. Custom rollup formulas let you control how Analysis Services aggregates values for individual dimension levels. In this case, I wanted to override the Time dimension's aggregations for all but the lowest hierarchy level. The formula I wanted to use is Time.CurrentMember.LastChild.
The Age measure requires some extra work because it's aggregated as an average, for which Analysis Services also offers no native aggregation type. To determine the average, you can combine Analysis Services' built-in Sum and Count aggregation types. I already needed a Count measure, so half of my problem was solved. But I needed to create in Analysis Manager another measurea Sum of the Age values. The average Age measure then becomes a calculated measure that divides the Age Sum by the Count. Because the Age Sum alone doesn't make sense, I set its Visibility property to false, which hides it from client applications. Hiding measures that store intermediate calculations is a common technique in Analysis Services applications.
Now that I'd designed the star schema and the cube, my final step was to design a process to load the star schema with records. Because this application was a side project rather than a production application, I tried the easiest method I knewloading the list of files into Microsoft Excel for manipulation. However, the first snapshot exceeded Excel's capacity. I was amazed to discover that I have 90,000 files on my C partition. I ended up writing a small program that scans the hard disk and creates a set of Comma Separated Value (CSV) files that match the star schema format that Figure 2 shows. When you run the program a second time, it reads in the comma-delimited dimension tables and adds new fact table records that reference previously created dimension table records.
After the program runs, you can use the comma-delimited files to build the cube. I imported the comma-delimited files into Access tables, then built the cube from the tables. As an alternative, you can obtain an OLE DB driver that supports text files. If you have this driver, you can build the cube directly from the CSV files.
If you're interested in using this application on your hard disk, you can download the .zip file that's available online at http://www.sqlmag.com (see "More on the Web" for download instructions). The diskdata.exe program will create the CSV files in the same directory to which you download the .zip file.
Gauging Usefulness
Was the resulting cube useful? While browsing the cube, I learned several things about my hard disk that I never knew before. For example, I discovered that a third (roughly 30,000) of all the files on my hard disk resided in the Internet Cache directory. (I now regularly flush that browser cache.) Although I haven't seen enough snapshots to spot a trend, I expect to see a continual growth path in the \system and \system32 directories.
The number of possible analysis applications is limitless. Business analysis and other analysis applications probably have a bigger potential market than Excel does. Whether analysis technologies grow to meet a fraction of the number of possible applications in the next 5 to 10 years is anyone's guess. I hope this file-analysis application leads you to imagine other possible OLAP applications that can give you a deeper understanding of your activities and help your business be more successful. For another example of a file-analysis application that you can build, see "August MDX Puzzle." (For the answer to last month's puzzle, see "July MDX Puzzle Solution Revealed.")