• subscribe
April 19, 2007 12:00 AM

SQL Server 2005 Data Mining Add-ins for Office 2007

Ferret out useful information hidden in your Excel data
SQL Server Pro
InstantDoc ID #95451

For our example, let's work with the Classification Matrix command. Click Classification Matrix in the Data Mining tab to bring up the Classification Matrix Wizard. Click Next three times, then, in the Select Source Data step, make sure the ‘Testing Data'!'Testing Data' table is selected. Click Next again, then click Finish; a worksheet named Classification Matrix will be created. According to this matrix, the model is 89.13 percent accurate in terms of predicting both bike buyers and non-bike buyers. You might be dismayed to discover the model correctly identifies only 16.12 percent of the actual bike buyers. However, keep in mind that the model correctly identifies non-bike buyers 97.37 percent of the time. If you recall, when we defined our business problem, we needed to minimize the cost of a direct mail campaign. This model identifies only a small subset of the likely buyers, but it eliminates most of the likely non-buyers.

I like to think of the accuracy this way: According to the test results, the model identified 120 likely bike buyers (2.63 percent of the non-bike buyers and 16.12 percent of the bike buyers). Of those 120, 49 were bike buyers, meaning the model has an accuracy of 41 percent. That percentage is much better than a random-guess approach (e.g., flipping a coin), which would be accurate about 10 percent of the time (the Training Data worksheet consists of 3,000 rows, of which approximately 10 percent are bike buyers).

On a side note, you could use another technique to make your model more aggressive. This technique, called oversampling, involves manipulating the source data to increase the frequency of rarely occurring data. The add-in's partitioning wizard supports oversampling, and you can learn more about this technique at the Microsoft Web site (see "Chapter 24 - Effective Strategies for Data Mining," in the SQL Server 2000 Resource Kit at http://www.microsoft.com/technet/prodtechnol/sql/2000/reskit/part6/c2461.mspx?mfr=true). With that said, let's stick with the current model and look at a profit chart for additional verification.

Merely for the purposes of this training example, let's make some assumptions about the size of our potential customer list, the variable and fixed costs, and gross revenue. Click the Profit Chart command in the Data Mining tab, click Next twice, then, in the Specify Profit Chart Parameters step, enter the following values:

  • Mining column to predict: BikeBuyer
  • Value to predict: Yes
  • Target population: 35000
  • Fixed cost: 5000
  • Individual cost: 5
  • Revenue per individual: 35

Click Next and make sure the ‘Testing Data'!'Testing Data' table is selected. Click Next again, then click Finish; a new worksheet will be created. Figure 4 shows the profit chart that's part of this worksheet.

According to the chart, you maximize profit by targeting 11 percent of the target population. The worksheet also indicates the optimal "probability threshold" is 15.14 percent. In other words, when the datamining model makes a prediction, the model also computes a probability for the prediction. The worksheet indicates we should target customers with a probability (to purchase a bike) of 15.14 percent or higher. You'll see this probability value in the final step.

Deploy and update model. In this example, Excel is your client tool, so no deployment is necessary. Let's go ahead and run your model against a small set of potential customers. In the Data Mining tab, click Query, then click Next twice. In the Select Source Data step, make sure the ‘New Customers'!'Table 17' table is selected. Click Next again and check that all the relationships (except BikeBuyer, of course) are properly mapped from the worksheet to your model attributes.

Click Next again, then click Add Output. In the Name field, enter ProbabilityToBuy. Select BikeBuyer from the Columns list, PredictProbability from the Column Functions list, and Yes from the Function Parameters list. Click OK to close the dialog box. Complete the Wizard by clicking Next and then Finish. A new column, ProbabilityToBuy, is added to the New Customers table. Using the probability threshold of 15.14 percent from the Profit Chart worksheet as a guide (and rounding your new column to the nearest hundredth), you should target 19 of the 78 potential customers in this set. You can now run a query against all of the potential customers and send the final results to the marketing director. Congratulations—you're finished!

More uses for the Data Mining Client for excel
We used the Data Mining Client for Excel add-in to walk through the entire datamining process. You can also use the add-in to browse or query an existing model. For example, if you've installed the AdventureWorks Analysis Services database sample, you can establish a connection to the database (using the Connection command at the Data Mining tab), then browse or query any of the models. A practical idea is to use Excel data as input to a data-mining query. With this approach, the model is ultimately built and maintained by IT folks (and might be processed using very large volumes of historical data) but is available to end users for validation, browsing, and querying.



ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here