Creating Mining Models
It’s time to create and process the mining models. SQL Server 2005 offers
many new data-mining algorithms. I chose Decision Trees and Microsoft Clustering
algorithms, partly because their graphical mining-model viewers are superior
to those of other algorithms, such as Linear Regression, but also because these
two algorithms are available in all editions of SQL Server Analysis Services.
Decision Trees is a Classification-type algorithm that can predict values based
on other attributes. Microsoft Clustering is a Segmentation-type algorithm that
can find natural groupings in data and predict outcomes based on those groupings.
The best way to create the initial Mining Structure that will contain the mining
models based on the two algorithms is to walk through the Data Mining Wizard
by right-clicking the Mining Structures folder in Solution Explorer and selecting
New Mining Structure. The Data Mining Wizard guides you through several screens
to gather input. The first question it asks is whether the structure will use
an existing relational database or warehouse or be based on an existing cube.
I’ll choose an existing relational database because I intend to use the
lottery_load table. The next question asks for the type of algorithm (or technique,
in the wizard’s terminology). I’ll have two models in one structure,
but I’ll start with the Decision Trees algorithm for the wizard. On the
next screen, I’ll select the WinBig data source view that I created and
use the lottery_load table derived from the Named Query. It’s the only
table available to select, and I’ll leave it set as a Case table instead
of a Nested table.
On the next wizard screen, I’ll select and define the data that will
train the model. As you can see in Figure 2,
all the fields from the Data SourceView are listed. You can define each field
as Key, Input, or Predictable. I’ll select the DrawDate field as the
Key field because I know each drawing will have a unique date value. For Input,
I’ll choose every field except DrawDate, and for the Predictable fields,
I’ll choose N1 through N6. After clicking Next, I’ll specify content
and data types for the fields.
There are several different content types, and each plays a significant role
in how the model will work with the input data. The three content types that
I have worked with while testing the output of the lottery-prediction models
are Discrete, Continuous, and Discretized.
Discrete values contain a finite number of items that shouldn’t be used
as values in calculations. In other words, even in my case—with fields
N1 through N6 containing numerical data—the preference would be to make
these fields Discrete and, further, to make their data type Text so that they
won’t be additive to one another. The possible lottery numbers range
from 1 to 53—a finite set of values. Think of the gender attributes,
which have two values, Male or Female. If I had a gender value to use in my
predictions, I wouldn’t want to add Male to Female, just as I don’t
want to add lottery numbers 1 to 12 to 43 to 6, and so on.
Continuous content types are measurements of numerical data and are additive.
I’ve noticed that the results of the model change when I alter the content
types. The Discrete types seem to work best for this project. Discretized value
types, which place Continuous numerical values in discrete groups,can be used
with algorithms,such as Naïve Bayes, that don’t otherwise permit Continuous
values. I could let the wizard detect the content types, but as Figure
3 shows, I set content types manually as Discrete or Key and used Text or
Date for the data types. Now, I can finish the wizard and see the new structure.
Processing
Next, I’ll process the mining model on the Mining Model tab in BIDS by
clicking the Process button in the upper left corner. When the Process window
opens, I’ll simply click Run, and when the processing is finished, I’ll
close the window and view the results.On the Mining Model Viewer,I’ll
see the derived Decision Tree. This is a simple result because the Discrete
and Text values have prevented the algorithm from performing many calculations
with the numbers. My simple goal for the model was to deliver six numbers. In
Figure 4, you can see the tree view for the
predictable value N1 and the highest to lowest cases for each number 1 to 53
for each. The probability values also appear, along with a histogram.
At this point, I decided to build a DMX query to select my six numbers before
moving on to add in the Microsoft Clustering model to the same structure and
compare the differences.
On the Mining Model Prediction tab, I found a tiny button that led me to the
Singleton Query. Singleton Queries are nice when you want to manually pass in
input for the predictable values without having to pull them from an input table
that stores values.The goal is to see what numbers the model will return for
the May 31, 2006, drawing date.That date also happens to be a Wednesday. Figure
5 shows the graphical Singleton Query. Because N1 through N6 are the only
values I used as Predictable, they’re the only ones I can select; however,
I added a Prediction Function for the probability for N1. Listing
2 shows the resulting DMX query. According to the Decision Trees algorithm,
15, 11, 26, 35, 41, and 48 will be the winning numbers for May 31, 2006.
On the second tab of the project, I clicked Add Related Model, specified Microsoft
Clustering, and named the new model Lottery_Cluster. New models added to the
base mining structure inherit the content and data types of that structure.
Therefore, all I had to do after adding the Microsoft Clustering model was to
reprocess. After doing so,I could analyze the new model results on the Mining
Model Viewer tab, which Figure 6 shows, and
I could use the same type of DMX query to predict the six numbers based on the
date input variables. The results of the Lottery_Cluster model were 2, 13, 22,
30, 34, and 53, with a probability for N1 of 5.71—not the greatest odds,
but the number does suggest that the model found a pattern marginally better
than a random guess.
Time to Play
Now, it was time to go buy some tickets. I purchased tickets for both series
of numbers, and I also added a random drawing that the lottery machine chose
for me.The randomly generated numbers were 11, 16, 20, 24, 26, and 49.
The winning numbers were 14, 24, 30, 34, 43, and 44. Is it luck that my Microsoft
Clustering algorithm hit two of the six numbers (30 and 34) whereas the random
number hit only one (24)? With further analysis and fine-tuning of the models,
I would hope to say that it wasn’t all luck. I grant that the random
nature of the drawing makes it nearly impossible to predict, which is why no
one else hit all six numbers this time. So, the total jackpot is now at $10
million and I have only three days left to retrain the models and try new algorithms.
I’ll let you know when I win.
Author’s Note: Special thanks to Jeff Dyktra, who helped me
build an elaborate normalized lottery database with many possible input values,
such as prime, odd, and even numbers. I chose a simple single table, but his
Integration Services package is worthy of an article in itself.
End of Article
Prev. page
1
[2]
next page -->