Number of nested-table states. For this experiment, we varied the number of unique banking products, which is the number of states of the Purchases table's ProductID column. The main case table (Customer) contained 200,000 customers (cases). From the main case table, we used five input attributes. Each customer purchased from 0 to 50 different products, which are stored in the nested table, Purchases. We varied the number of products in a store from 100 to 1000. As the rate at which each customer bought products during each visit remains constant, the Purchases table's nested key (ProductID) becomes sparse (i.e., contains many nulls).
Graph 5 shows the training times for increasing numbers of states in the nested table. The training took significantly more time than similar situations that didn't use a nested-table model. When the number of states of nested-table keys was less than 200, training time increased. Beyond 255 states, the training time starts to decrease because when the nested-table key number is more than 255, the MDT algorithm uses feature-selection techniques to pick the most important keys (ProductIDs). It uses a marginal (single-node) modelbasic statistics gathered from the data setto predict the remaining products. By default, 255 is the maximum number of trees a mining model can have. When the number of nested-table key states is more than 255, as the customer purchase rate remains the same, the nested-table key becomes sparser (i.e., each product is purchased less frequently). Thus, with few related patterns for each of the keys, the trees become smaller and training time decreases.
Number of products purchased per customer. For this experiment, we fixed the sample size at 200,000 and the number of different products in the nested table at 1000. Then, we varied the average number of products each customer bought from 10 to 100. Graph 6, page 58, shows that the MDT training times scale better than linearly with the number of products purchased per customer. This result is reasonable because training time increases roughly linearly with the number of input attributes.
Sample size of master case table. In this experiment, we fixed the average number of customer purchases at 25 and the number of states in the nested table at 200, and we varied the number of cases (customers) from 10,000 to 200,000. Because the key ratio (the number of purchases per customer) is fixed, the number of rows in the nested table will increase when the number of customers increases. Graph 7, page 58, shows that training time increases linearly with the sample size. Training a model with 200,000 customers in the main case table and 10 million rows in the nested table takes about 4 hours; training a model with 100,000 customers takes about 2 hours. Note that training 10 million cases without nested tables takes only 100 minutes, as Graph 2 shows.
Summary of MDT Results
Our results show that the performance of the MDT algorithm is fast and scalable within certain parameter configurations. Specifically, you get the best performance when the number of input attributes doesn't exceed 100, the number of cases doesn't exceed 10 million, the number of input attribute states doesn't exceed 20, and the number of predictable attributes doesn't exceed the number of CPUs available. Data-mining model-training performance is highly predictable and acceptable within this "sweet spot." As you might expect, our results also showed that performance for non-nested cases is considerably better than for nested cases, although nested cases bring you valuable functionality.
Our test results revealed that, within certain parameters, the training performance you can expect from the two Microsoft algorithms (MDT and Clustering) is both efficient and scalable. These factors become increasingly important when you're dealing with the larger, more complex data sets that e-business and customer relationship management (CRM) applications produce.
With Analysis Services, data mining is no longer a domain reserved for statisticians. The complexity of the data-mining algorithms is invisible to the user. Through Microsoft's OLE DB for Data Mining, much of the theoretical complexity of data mining is also invisible to the database developer. So, what does this mean for SQL Server developers and DBAs? Without having to get a degree in statistics or master special-purpose technology, SQL Server professionals can quickly learn to create and train data-mining models and embed these advanced features into their business intelligence (BI) applications. Applying the knowledge and techniques we presented in this article can help you predict the resource consumption of the data-mining part of Analysis Services so that you can better plan for its impact on your environment and its deployment in your BI and data-warehousing applications.
End of Article
Prev. page
1
2
[3]
next page -->