• subscribe
August 23, 2006 12:00 AM

Win the Lottery on Your Way to Analysis Services 2005 Mastery

Use BI applications to predict your way to enormous wealth
SQL Server Pro
InstantDoc ID #92822
Downloads
92822.zip

Most large companies these days have adopted some form of business intelligence (BI) platform,incorporating data warehouses,data analytics, and mining to transform information into profitable decisions. I’ve gained much experience working with SQL Server 2005 Analysis Services’new BI tools, and I recently came to the greedy realization that I could use these same BI applications to find predictable patterns in data for which the profit margin would be off the chart—for myself. Taking a rich (pun intended) data set such as, oh, state lottery draws over eight years and analyzing the patterns—even though most rational people would argue that such data is purely random—seemed like a straightforward and fun first project, so I decided to tackle it. The results were interesting.

Want to know the steps I took to build my Lottery Prediction Analysis Services project and train two data-mining models, comparing the different algorithms and results? Look no further. Or perhaps you think this kind of experimental exercise is merely frivolous? If so, think again. Although using SQL Server 2005 data-mining techniques to try to win the lottery might seem no more than a fun diversion, you will be able to directly translate these techniques into your business projects, as I have in several healthcare scenarios. Often, fun exercises are the best way to learn otherwise intimidating tools and technologies—and, as a side benefit, this article might just make you disgustingly wealthy.

Loading Data
The first step of data analysis is to load data into tables. Depending on the level of data transformation that will be necessary, populating database tables with source data can either be as painless as using the built-in import features of SQL Server 2005’s SQL Server Management Studio or as arduous as slogging through SQL Server Integration Services (SSIS). For my project, I took the painless route of importing lottery data into a database I named WinBig, in a single table of seven fields.The table, which I called lottery_load,consisted simply of a field to store the date of the drawing (DrawDate) and six number fields (N1 through N6) to hold the individual lottery numbers. I was able to easily find the entire dataset of all Florida Lotto drawings from May 7, 1998, to May 27, 2006, with a total of 1287 individual drawings. (I’m sure most states with lotteries have similar lists on their Web sites.)

After some coaxing to get the HTML list of drawings into a delimited dataset of the seven fields I needed, I stepped through the SQL Server Import and Export Wizard by right-clicking the database,then selecting Tasks and Import Data.When I was done, I had my data source to use for the Analysis Services project and was ready to discover hidden riches buried deep in the digits. (I should note that the process of extracting, transforming, and loading data from source to destination, as in a data-warehouse scenario, is typically the most time-consuming task of an Analysis Services project,especially when you’re training mining models, because you want your data to be as accurate as possible.)

I would now perform all data analysis in Business Intelligence Management Studio (BIDS), a Visual Studio 2005 shell application that ships free with SQL Server 2005. I created a new Analysis Services project by selecting File, New, Project. One of the project types you can select is Analysis Services Project. I named the project Lottery Prediction. In the Solution Explorer window that Figure 1 shows, you can see all the objects available to create. Before I could create a Mining Structure, I needed to create two of these objects: Data Sources and Data Source Views.

Data Sources. Creating the data source is as easy as right-clicking the Data Sources folder in BIDS and selecting New Data Source. The New Data Source wizard walks you through the process of choosing a server, authentication method, and database. In my case, I chose SRS01 as the server and WinBig as the database that contains the lottery_load table that I pre-populated with the lottery-drawing history.

Data Source Views. The Data Source Views step is a little more involved, primarily because simply choosing the table wouldn’t be sufficient for analysis. What I needed was a custom query that would provide not only the historical lottery numbers but also values that could be used as input to train the subsequent mining models. For example, suppose you’re trying to predict what type of product a customer will purchase based on certain criteria gathered about that customer. Knowing a customer’s demographics (e.g., income, marital status, geographic region) is a key component of determining the purchasing decisions he or she might make.All this associated data is fed into the mining model as inputs for the predictable items—such as the probability that he or she will purchase a particular item. This kind of analysis matters little in the random nature of a lottery, but providing the model with this type of associative data might reveal hidden patterns. In the case of the lottery prediction, the draw date is one data element, in addition to the actual numbers, that can be used as valuable input for predictable patterns. I can create a Data Source View based on one or more source tables,as in the case of the lottery_load table, but I can also use a Named Query instead of a table to achieve more flexibility in my schema: Using a Named Query lets me join underlying tables into one entity, filtering which columns are used and filtering data based on criteria.

The easiest way to create a Data Source View based on a Named Query is to right-click the Data Source Views folder in Solution Explorer and select New Data Source View to initiate the wizard. I’ll select the table from the WinBig data source that I created in the previous step, and I’ll walk through the wizard, selecting the lottery_load table and clicking Finish to create the Data Source View named WinBig.dsv. However,after I create the Data Source View, I have the freedom to right-click the table header and select Replace Table, With New Named Query. Named Queries let me more easily define new columns that aren’t in the base table.

With a Named Query, I added several new fields to be used as input items to help the mining model algorithms predict lottery numbers.The fields I added to the base table are all derivatives of the DrawDate field. Knowing which day of the week,for example, might reveal important trends in the resulting query. Just as important could be the month and year of the drawing. Knowing the moon phase, tidal shifts, local weather fore-casts, and barometric pressure might be just as reasonable in this exercise, but we have to work with the data that’s available.

Listing 1, shows the Named Query that will drive the mining models; notice the new columns derived from the DrawDate field.The WeekPart field might prove to be important while training the models because the lottery is drawn two days per week—Wednesday and Saturday, the fourth and seventh day of the week for the WeekPart function.



ARTICLE TOOLS

Comments
  • meganbearly
    4 years ago
    May 28, 2008

    eyejay,
    Thank you for your comments. It has been a couple of years since writing this article. Since then I have wanted to revisit this idea and fine tune the DMX queries. I will possibly do this with the release of SQL Server 2008.
    Still have not hit the lottery, though.
    Rodney

  • Ian
    4 years ago
    May 24, 2008

    Excellent article, interesting cases are a good way to learn but as a bit of a novice, I was able to follow most of the text although a little more explanation of using the DMX query in Cluster mining would have been really helpful

  • gray ghost,jjgonzales
    5 years ago
    Jun 13, 2007

    Hi Rodney, Thanks for the article. I am new to sql but, by doing these types of things, it really helps to broaden my understanding.

    When I attempt to process the mining model, I get the following error message.

    OLE DB error: OLE DB or ODBC error: The text data type cannot be selected as DISTINCT because it is not comparable.; 42000.

    Thanks jjg

  • Diana
    5 years ago
    Mar 09, 2007

    I would take a look at the content types of the mining model fields making sure that the fields N1 to N6 are set to “discrete”. It might also be worthwhile to try different combinations of field types; the draw date will be the Key field and N1 to N6 can be Input Only or Input and Predict. In the DMX query, then, it is possible to use the number fields themselves in the predict join as well as the dates. For example, you might want to write a DMX query (or use the singleton query designer) that predicts the draw numbers for March, 24, 2007 but should return different results if the criteria was March, 24, 2007 AND N3 was 51 AND N2 was 21. Breaking out the DrawDate field to months, days and years, which I believe I did in the article, should also add more level of prediction.

    I hope this helps. I will have to build the solution again in order to test these concepts as it has been quite a long time since I wrote the article. I certainly do appreciate the reader’s comments.

    Thanks.
    Rodney

  • Chris
    5 years ago
    Mar 08, 2007

    Hi Rodney,

    I put together a mining package similar to the one you feature here but I can't get the query to return anything different for different input dates. It always returns the exact same set. I did get a different set for the clustering algorithm than I did for the decision tree. Any ideas as to how to modify the query so that the results reflect the day specified in the query?

You must log on before posting a comment.

Are you a new visitor? Register Here