Most large companies these days have adopted some form of business intelligence
(BI) platform,incorporating data warehouses,data analytics, and mining to transform
information into profitable decisions. I’ve gained much experience working
with SQL Server 2005 Analysis Services’new BI tools, and I recently came
to the greedy realization that I could use these same BI applications to find
predictable patterns in data for which the profit margin would be off the chart—for
myself. Taking a rich (pun intended) data set such as, oh, state lottery draws
over eight years and analyzing the patterns—even though most rational
people would argue that such data is purely random—seemed like a straightforward
and fun first project, so I decided to tackle it. The results were interesting.
Want to know the steps I took to build my Lottery Prediction Analysis Services
project and train two data-mining models, comparing the different algorithms
and results? Look no further. Or perhaps you think this kind of experimental
exercise is merely frivolous? If so, think again. Although using SQL Server
2005 data-mining techniques to try to win the lottery might seem no more than
a fun diversion, you will be able to directly translate these techniques into
your business projects, as I have in several healthcare scenarios. Often, fun
exercises are the best way to learn otherwise intimidating tools and technologies—and,
as a side benefit, this article might just make you disgustingly wealthy.
Loading Data
The first step of data analysis is to load data into tables. Depending on the
level of data transformation that will be necessary, populating database tables
with source data can either be as painless as using the built-in import features
of SQL Server 2005’s SQL Server Management Studio or as arduous as slogging
through SQL Server Integration Services (SSIS). For my project, I took the painless
route of importing lottery data into a database I named WinBig, in a single
table of seven fields.The table, which I called lottery_load,consisted simply
of a field to store the date of the drawing (DrawDate) and six number fields
(N1 through N6) to hold the individual lottery numbers. I was able to easily
find the entire dataset of all Florida Lotto drawings from May 7, 1998, to May
27, 2006, with a total of 1287 individual drawings. (I’m sure most states
with lotteries have similar lists on their Web sites.)
After some coaxing to get the HTML list of drawings into a delimited dataset
of the seven fields I needed, I stepped through the SQL Server Import and Export
Wizard by right-clicking the database,then selecting Tasks and Import Data.When
I was done, I had my data source to use for the Analysis Services project and
was ready to discover hidden riches buried deep in the digits. (I should note
that the process of extracting, transforming, and loading data from source to
destination, as in a data-warehouse scenario, is typically the most time-consuming
task of an Analysis Services project,especially when you’re training
mining models, because you want your data to be as accurate as possible.)
I would now perform all data analysis in Business Intelligence Management Studio
(BIDS), a Visual Studio 2005 shell application that ships free with SQL Server
2005. I created a new Analysis Services project by selecting File, New, Project.
One of the project types you can select is Analysis Services Project. I named
the project Lottery Prediction. In the Solution Explorer window that Figure
1 shows, you can see all the objects available to create. Before I could
create a Mining Structure, I needed to create two of these objects: Data Sources
and Data Source Views.
Data Sources. Creating the data source is as easy as right-clicking
the Data Sources folder in BIDS and selecting New Data Source. The New Data
Source wizard walks you through the process of choosing a server, authentication
method, and database. In my case, I chose SRS01 as the server and WinBig as
the database that contains the lottery_load table that I pre-populated with
the lottery-drawing history.
Data Source Views. The Data Source Views step is a little more
involved, primarily because simply choosing the table wouldn’t be sufficient
for analysis. What I needed was a custom query that would provide not only the
historical lottery numbers but also values that could be used as input to train
the subsequent mining models. For example, suppose you’re trying to predict
what type of product a customer will purchase based on certain criteria gathered
about that customer. Knowing a customer’s demographics (e.g., income,
marital status, geographic region) is a key component of determining the purchasing
decisions he or she might make.All this associated data is fed into the mining
model as inputs for the predictable items—such as the probability that
he or she will purchase a particular item. This kind of analysis matters little
in the random nature of a lottery, but providing the model with this type of
associative data might reveal hidden patterns. In the case of the lottery prediction,
the draw date is one data element, in addition to the actual numbers, that can
be used as valuable input for predictable patterns. I can create a Data Source
View based on one or more source tables,as in the case of the lottery_load table,
but I can also use a Named Query instead of a table to achieve more flexibility
in my schema: Using a Named Query lets me join underlying tables into one entity,
filtering which columns are used and filtering data based on criteria.
The easiest way to create a Data Source View based on a Named Query is to right-click
the Data Source Views folder in Solution Explorer and select New Data Source
View to initiate the wizard. I’ll select the table from the WinBig data
source that I created in the previous step, and I’ll walk through the
wizard, selecting the lottery_load table and clicking Finish to create the Data
Source View named WinBig.dsv. However,after I create the Data Source View, I
have the freedom to right-click the table header and select Replace Table, With
New Named Query. Named Queries let me more easily define new columns that aren’t
in the base table.
With a Named Query, I added several new fields to be used as input items to
help the mining model algorithms predict lottery numbers.The fields I added
to the base table are all derivatives of the DrawDate field. Knowing which day
of the week,for example, might reveal important trends in the resulting query.
Just as important could be the month and year of the drawing. Knowing the moon
phase, tidal shifts, local weather fore-casts, and barometric pressure might
be just as reasonable in this exercise, but we have to work with the data that’s
available.
Listing 1, shows the Named Query that will
drive the mining models; notice the new columns derived from the DrawDate field.The
WeekPart field might prove to be important while training the models because
the lottery is drawn two days per week—Wednesday and Saturday, the fourth
and seventh day of the week for the WeekPart function.