Optimize data-mining model training with Analysis Services

Data mining helps users find hidden patterns and useful business information from large data sets. Corporations can use these patterns and rules to improve their marketing, sales, and customer-support operations by better understanding their customers. The most resource-intensive task in data mining is data-mining model training. A business analyst might need to mine millions of cases with hundreds of attributes and hundreds of states (values) to see the patterns that lead to effective revenue prediction or customer segmentation. But you can optimize the performance of SQL Server 2000's data-mining model training. In "Put Data Mining to Work," November 2001, we introduced some data-mining terms and examined how Microsoft implemented data mining in SQL Server 2000 Analysis Services. We also discussed the Microsoft Decision Trees (MDT) and Microsoft Clustering algorithms in some detail and demonstrated how you can use Analysis Services to build data-mining models to solve sample business problems. In this article, we summarize the results of our testing of the MDT algorithm and suggest some parameters that will get you the most consistent, satisfactory model-training performance.

Microsoft added new functionality to SQL Server 2000 to perform data-mining tasks on both relational data (in relational tables) and multidimensional data (in OLAP cubes). In addition to the data-mining algorithms, this functionality includes extensions to T-SQL and the OLE DB extension OLE DB for Data Mining, which is the result of a collaboration between Microsoft and other data-mining providers. Through the use of OLE DB's Data Shaping Service and a special column content type called a table column, OLE DB for Data Mining lets you use both nested cases and non-nested (simple) cases for training and prediction. (For more information about column content types, see the Web sidebar "Column Content Types" at http://www.sqlmag.com, InstantDoc ID 23092.)

In testing SQL Server 2000's data-mining model­training performance, we conducted hundreds of experiments, focusing on various factors in the training data that affect the scalability of performance. These factors are the training data set's parameter configurations. For example, we varied the number of input attributes, the sample size of the training cases, the number of states for each input attribute, and the number of predictable (output) attributes. With each test, we varied the parameters for the test factor across a range of typical values while holding the other factors constant so that we could isolate the effect on performance of the parameter being tested. We included information about the training times for both non-nested and nested cases. You can find more details about this study in the white paper "Performance Study of Microsoft Data Mining Algorithms" (http://www.unisys.com/windows2000/default-07.asp).

About the Performance Tests
After you create the data-mining model, the next step is training. The model-creation process creates only the model's structure; training adds the model's contents. First, you use sample (training) data to populate the model. Then, you use a data-mining algorithm to evaluate that data and find patterns. Training is usually the most time-consuming step in the data-mining process because the algorithm might have to iterate over the training data set several times to find the hidden patterns. After you train a model, you can query it. Of course, when any input attribute, behavior, or environment changes, or when new data becomes available for training, you need to retrain the model to reflect the changes.

We performed hundreds of experiments that looked at model-training performance based on different parameter configurations of the training data set. We divided the tests into two studies: one for the MDT algorithm and one for the Microsoft Clustering algorithm. We further divided each study into a group of experiments for non-nested cases and a group for nested cases. Each experiment consisted of several tests in which we measured the training time as a function of the test factor as we varied that factor across a range of discrete values. Meanwhile, we held all the other factors constant. For example, we measured the effect of the number of input attributes while holding constant the sample size, number of states of each input attribute, and number of predictable attributes. Each measurement produced a data point in the result graph.

Many factors can influence a case, and the number of cases in most data-mining samples can be in the millions. As the scale of the data set expands, identifying which factors have the greatest influence on performance becomes more difficult. When resource constraints rather than business considerations influence the selection of cases, the algorithm can have even more difficulty identifying the predictable attributes in the data. For example, because of time and resource constraints, someone might use sampling techniques to identify only a subset of input cases when a more thorough selection technique would be a better choice. If the input data isn't homogeneous enough, using a sampling technique might leave behind valuable data.

We ran our tests on a Unisys e-@ction Aquanta ES5045R server with four Pentium III Xeon 550MHz CPUs. On this server, we ran Windows 2000 Advanced Server and SQL Server 2000 Enterprise Edition with Analysis Services. We configured the server with a 512MB cache, 4GB of RAM, and an OSM7700 Fibre Channel data storage unit with five 9GB disks in each RAID 5 disk array. The server was connected to a 100Mbps Ethernet network.

We generated the test data from the typical retail banking scenarios that Table 1 shows. We used what we felt were typical value ranges for each of the factors that influence performance. Decision trees are useful for classifying data and predicting outcomes, especially across a few broad categories. Our performance study for the MDT algorithm consisted of seven experiments: four for non-nested cases and three for nested cases. Each experiment measured the effect that one factor had on the scalability of performance over a typical range of values. For each of the seven experiments, Table 2 shows the factor we tested, the case type (non-nested or nested), and the range of values we used for each test factor. For each case, the MDT algorithm considered hundreds of attributes and millions of records to come up with the decision tree that best described the rules for prediction. (Because of space constraints, we haven't included the details of our Clustering algorithm experiments. See the sidebar "Cluster Analysis Experiment," page 56, for more information.)

   Prev. page   [1] 2 3     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.