• subscribe
January 18, 2005 12:00 AM

Data Mining Reloaded

What's new for data mining in SQL Server 2005 Analysis Services
SQL Server Pro
InstantDoc ID #44821

The two main functions of data mining are classification and prediction (or forecasting). Data mining helps you make sense of those countless gigabytes of raw data stored in databases by finding important patterns and rules present in the data or derived from it. Analysts then use this knowledge to make predictions and recommendations about new or future data. The main business applications of data mining are learning who your customers are and what they need, understanding where the sales are coming from and what factors affect them, fashioning marketing strategies, and predicting future business indicators.

With the release of SQL Server 2000, Microsoft rebranded OLAP Services as Analysis Services to reflect the addition of newly developed data-mining capabilities. The data-mining toolset in SQL Server 2000 included only two classical analysis algorithms (Clustering and Decision Trees), a special-purpose data-mining management and query-expression language named DMX, and limited client-side controls, viewers, and development tools.

SQL Server 2005 Analysis Services comes with a greatly expanded set of data-mining methods and an array of completely new client-side analysis and development tools designed to cover most common business intelligence (BI) needs. The Business Intelligence Framework in SQL Server 2005 offers a new data-mining experience for analysts and developers alike.

Let's quickly review the data-mining process. Then we'll explore the seven data-mining algorithms available in the SQL Server 2005 Analysis Services framework and briefly look at the "plug-in" technology that can help you add new and custom algorithms to that framework. Although we couldn't specifically address the data-mining UI design here, the snapshots included in several examples will give you a good first look at the power and usability of the new client-side tools.

Mining the Data
The design and deployment of a data-mining application consists of seven logical steps. First, you prepare the data sources: Identify the databases and connection protocols you want to use. Next, you describe the data-source views—that is, list tables that contain data for analysis. Third, define the mining structure by describing which columns you want to use in the models. The fourth step is to build mining models. SQL Server 2005 gives you seven data-mining algorithms to choose from—you can even apply several methods in parallel to each mining structure, as Figure 1 shows. The fifth step is called processing—that's where you get the mining models to "extract knowledge" from the data arriving from the data sources. Sixth, you evaluate the results. Using client-side viewers and accuracy charts, you can present the patterns and predictions to analysts and decision makers, then make necessary adjustments. Finally, incorporate data mining into your overall data-management routine—having identified the methods that work best, you'll have to reprocess the models periodically in order to track new data patterns. For instance, if your data source is email and your models predict spam, you'll need to retrain the models often to keep up with evolving spammer tactics.

Here's a quick example of a useful mining model. Let's say you're interested in identifying major groups of potential customers based on census data that includes occupational, demographic, and income profiles of the population. A good method for identifying large, characteristic census groups is to use the Clustering algorithm. This algorithm segments the population into clusters so that people in one cluster are similar and people in different clusters are dissimilar in one or more ways. To examine those clusters, you can use a tool called Microsoft Cluster Viewer (a standard Analysis Services 2005 component). Figure 2 shows one of the four views, giving you a side-by-side comparison of all the clusters. For instance, Clusters 6 and 7 correspond to persons not on active military duty. But Cluster 7 represents people who work longer hours for more income; the top row also suggests that people in Cluster 7 are mostly married.

Prediction and Mutual Prediction
Suppose you've selected just one column (e.g., Income) in a huge data table, designated that column as Prediction target, and now you're trying to make some predictions. But you probably won't get far by looking at just one column. You can compute the statistical mean and the variance range, but that's about it.

Instead, select specific values for one or more other columns (e.g., Age, Years of Experience, Education, Workload in census data tables) and focus only on those data rows that correspond to the selected values. You'll likely find within this subset of rows that the values of the target column fall into a relatively narrow range—now you can predict the values in the target column with some degree of certainty. In data-mining terms, we say that those other columns predict the target column.

Figure 3 shows a snapshot the Dependency Network (DepNet) data-mining control. This DepNet is a diagram where arrows show which of the census columns predict which others. Some of the edges (nodes) have arrows pointing both ways; this is called mutual prediction. Mutual prediction between A and B means that setting values of A reduces the uncertainty in column B, but also the other way around—picking a value of B would reduce the uncertainty of A.

All Microsoft data-mining techniques can track prediction, but different algorithms make predictions in different ways. As we examine the other data-mining methods, we point out the prediction specifics of each method.

Decision Trees
Prediction is the main idea behind the Microsoft Decision Trees (DT) algorithm. The knowledge that a DT model contains can be represented graphically in tree form, but it could also appear in the form of "node rules." For example, in a census decision tree for Income, a rule such as (Gender = Male and 1 < YearsWorked < 2) could describe a tree node containing the income statistics for males in their second year on the job. This node corresponds to a well-defined subpopulation of workers, and you should be able to make fairly specific predictions with regards to their income. Indeed, one of the census models gave the following formula under the condition of (Gender = Male and 1 < YearsWorked < 2):

INCOME = 24804.38+425.99*( YRSSRV -1.2)
+392.8*(HOURS-40.2) + 4165.82*(WORKLWK-1.022) 
± 24897

According to this formula, INCOME is defined mostly by YRSSRV and weekly overtime. (Note that this is just an example and not based on representative census data.) To obtain this equation in a visually simple way, you could use the Decision Tree viewer to view the Income tree and zoom in on a node corresponding to the gender and yearsworked values of interest, as the typical snapshot in Figure 4 shows.

The rule and the formula we've discovered identify gender, years of service, years worked, weekly hours, and workload as predictors for income. Because YRSSRV, HOURS, and WORKLWK appear in the above formula for INCOME, they're also called regressors. A decision tree that hosts such predictive formulas is called a regression tree.



ARTICLE TOOLS

Comments
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here