DOWNLOAD THE CODE:
Download the Code 21500.zip

Use this analysis application to determine how you filled your hard disk

Most users who know about analysis applications recognize the power and value of analysis but don't realize the breadth of situations to which analysis applications are applicable. Probably 80 percent of OLAP product demonstrations use sales data. The sample database that SQL Server 2000 Analysis Services offers, FoodMart 2000, is a case in point. Demonstrations use sales information frequently because most people can relate to sales as a common business problem—not because of any OLAP limitation. In my July 2001 column, "Tracking Performance," and in this column, I discuss two distinct types of analysis applications that demonstrate the vastly different problems that you can use OLAP technology to analyze. Last month, I wrote about an analysis application for software development. This month, I describe an application that's relevant to anyone who uses a computer—monitoring disk utilization.

If you're like me, your computer's hard disk is nearly full. Occasionally, I clean out directories that contain unused files and uninstall infrequently used programs, but each time I perform such maintenance, I free up less disk space than the time before. To help solve my problem, I wrote a small program that scans my hard disk, collects information about files and subdirectories, and stores this information in a set of comma-delimited tables in a star schema format. Next, I used Microsoft Access to import this data to Access tables. Then, I used Analysis Services' Analysis Manager to create an OLAP cube. Figure 1 shows this process.

Design and Build
Before you start any design process, you need to identify your objectives. An easy way to lay out an analysis application's objectives is to list a sampling of the questions that the analyst wants to answer. With such a list in mind, you can work backward to determine a flexible cube structure that will answer these questions. Then, you can take the next step, which is to determine the structure of a star schema that supports the building of this cube. In my disk-utilization problem, I wanted to answer the following questions:

  1. Which directories utilize the most disk space?
  2. What's the average age of the files in a directory?
  3. How many files reside in a directory (and all levels below the directory)?
  4. What's a particular directory's growth trend over time?
  5. Which file types (i.e., extensions) consume the most space?

To answer these questions, the cube needs three measures:

  1. Size—consumed disk space
  2. Age—average file age, or number of days since modification
  3. Count—number of files

I identified four useful dimensions (including the Measures dimension):

  1. Files—a dimension containing the disk's file structure
  2. Extensions—a dimension containing all the disk's possible file extensions
  3. Time—a dimension containing all the time periods during which the software took snapshots of the file system
  4. Measures—a dimension containing the cube's measures

The two most important dimensions in a cube are Time and Measures. These dimensions are essential because many of the functions in Analysis Services' MDX query language use these dimensions. If you don't structure these dimensions correctly, you'll eliminate at least half of MDX's usefulness. Of course, poorly selected measures will make a cube useless because measures determine every numeric value in a cube.

For this hard disk application, you can set up the Time dimension in several ways. The most obvious method is to use either the file's creation date or its modification date. By using one of these dates, you can answer questions such as, How much space do files created (or modified) during this period consume? That question is mildly interesting but not as useful as the ability to see growth trends over time. Growth trends over time show you at which locations disk activity occurred during a certain time period. For example, you might discover that a particular directory tripled in size last May. You could then use the Extensions dimension to determine which types of files grew the most in that directory. Comparing the various measures would tell you whether the growth occurred because of files increasing in number or in size.

Now that you've determined the cube's structure, you can design a star schema from which to load it. The star schema must contain all the files and directories with their sizes and file extensions. To let you compare multiple snapshots of the file system over time, the star schema must also contain multiple snapshots of the files. Figure 2 shows the structure I chose to support these requirements. The fact table (FileFact) contains a record for each file and directory, specific to each snapshot date. The structure includes dimension tables for time (FileTime), files (FileDir), and file extensions (FileExt).

   Prev. page   [1] 2     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

An interesting article, but not everybody stores their data on the c: drive! The Diskdata.exe program would have been much more useful had it includeded the ability to chose the drive you wanted to analyze. While not adverse to doing some work to take advantage of interesting techniques, I was disappointed not to be able to get any immediate benefit from the article.

NCharles