For our example, let's work with the
Classification Matrix command. Click Classification Matrix in the Data Mining tab to
bring up the Classification Matrix Wizard.
Click Next three times, then, in the Select
Source Data step, make sure the ‘Testing
Data'!'Testing Data' table is selected. Click
Next again, then click Finish; a worksheet
named Classification Matrix will be created.
According to this matrix, the model is 89.13
percent accurate in terms of predicting both
bike buyers and non-bike buyers. You might
be dismayed to discover the model correctly
identifies only 16.12 percent of the actual
bike buyers. However, keep in mind that the
model correctly identifies non-bike buyers 97.37 percent of the time. If you recall,
when we defined our business problem, we
needed to minimize the cost of a direct mail
campaign. This model identifies only a small
subset of the likely buyers, but it eliminates
most of the likely non-buyers.
I like to think of the accuracy this way:
According to the test results, the model identified 120 likely bike buyers (2.63 percent of
the non-bike buyers and 16.12 percent of
the bike buyers). Of those 120, 49 were bike
buyers, meaning the model has an accuracy
of 41 percent. That percentage is much better
than a random-guess approach (e.g., flipping
a coin), which would be accurate about
10 percent of the time (the Training Data
worksheet consists of 3,000 rows, of which
approximately 10 percent are bike buyers).
On a side note, you could use another
technique to make your model more
aggressive. This technique, called oversampling, involves manipulating the source data
to increase the frequency of rarely occurring
data. The add-in's partitioning wizard supports oversampling, and you can learn more
about this technique at the Microsoft Web
site (see "Chapter 24 - Effective Strategies
for Data Mining," in the SQL Server 2000
Resource Kit at http://www.microsoft.com/technet/prodtechnol/sql/2000/reskit/part6/c2461.mspx?mfr=true). With that said,
let's stick with the current model and look at
a profit chart for additional verification.
Merely for the purposes of this training
example, let's make some assumptions about
the size of our potential customer list, the
variable and fixed costs, and gross revenue.
Click the Profit Chart command in the
Data Mining tab, click Next twice, then,
in the Specify Profit Chart Parameters step,
enter the following values:
- Mining column to predict: BikeBuyer
- Value to predict: Yes
- Target population: 35000
- Fixed cost: 5000
- Individual cost: 5
- Revenue per individual: 35
Click Next and make sure the ‘Testing
Data'!'Testing Data' table is selected. Click
Next again, then click Finish; a new worksheet will be created. Figure 4 shows the
profit chart that's part of this worksheet.
According to the chart, you maximize
profit by targeting 11 percent of the target
population. The worksheet also indicates
the optimal "probability threshold" is 15.14
percent. In other words, when the datamining model makes a
prediction, the model
also computes a probability for the prediction. The worksheet
indicates we should
target customers with
a probability (to purchase a bike) of 15.14
percent or higher.
You'll see this probability value in the
final step.
Deploy and update
model. In this example,
Excel is your client tool, so no deployment is necessary. Let's go
ahead and run your model against a small set
of potential customers. In the Data Mining
tab, click Query, then click Next twice. In
the Select Source Data step, make sure the
‘New Customers'!'Table 17' table is selected.
Click Next again and check that all the
relationships (except BikeBuyer, of course)
are properly mapped from the worksheet to
your model attributes.
Click Next again, then click Add
Output. In the Name field, enter ProbabilityToBuy. Select BikeBuyer from the
Columns list, PredictProbability from the
Column Functions list, and Yes from the
Function Parameters list. Click OK to
close the dialog box. Complete the Wizard
by clicking Next and then Finish. A new
column, ProbabilityToBuy, is added to the
New Customers table. Using the probability
threshold of 15.14 percent from the Profit
Chart worksheet as a guide (and rounding
your new column to the nearest hundredth),
you should target 19 of the 78 potential
customers in this set. You can now run a
query against all of the potential customers
and send the final results to the marketing
director. Congratulations—you're finished!
More uses for the Data Mining
Client for excel
We used the Data Mining Client for Excel
add-in to walk through the entire datamining process. You can also use the add-in
to browse or query an existing model. For
example, if you've installed the AdventureWorks Analysis Services database sample,
you can establish a connection to the database (using the Connection command at
the Data Mining tab), then browse or query
any of the models. A practical idea is to use
Excel data as input to a data-mining query.
With this approach, the model is ultimately
built and maintained by IT folks (and might
be processed using very large volumes of
historical data) but is available to end users
for validation, browsing, and querying.