• subscribe
December 20, 2000 12:00 AM

Data Mining in SQL Server 2000

SQL Server Pro
InstantDoc ID #16175
Downloads
16175.zip

Data-Mining Tasks That Analysis Services Supports
You can apply data mining to many different tasks, which fall into three basic categories: outcome models, cluster models, and affinity models. Outcome models (which Microsoft calls classification) can help you predict or classify an outcome based on one or more fields, or variables, in the data set. You use cluster models, sometimes called segmentation, to group similar cases together based on the shared values of many fields in a data set. Affinity models—including association, sequence, and deviation analysis—and dependency modeling typically show the relationship or sequencing between one field and another field. SQL Server 2000 Analysis Services provides two basic data-mining algorithms to support classification and clustering: decision trees and cluster analysis.

Outcome models with decision trees. Outcome modeling uses a set of input variables to predict or classify the value of a target, or response, variable (the outcome). The target variable can be categorical (having discrete values such as reply/did not reply) or continuous (having values such as dollar amount purchased). When the target is categorical, you call it a classification task—a model that shows which combinations of the input variables you can use to reliably classify the target. When the target variable is continuous, the model is typically described as a regression model. Regression is the most common type of analysis that attempts to predict values of a continuous target variable based on the combined values of the input variables. For simplicity, Microsoft uses classification to mean both classification and regression trees. If this is confusing, remember that decision trees can predict or classify both discrete (some values) and continuous (many numeric values) outcomes.

Decision trees are a common and robust technique for carrying out predictive modeling tasks that have an outcome field to train on. Decision trees are easy to work with, produce a highly readable graphic display, and work well with both categorical and continuous data. Table 1 shows how you might arrange data to measure the response to an invitation to an IT conference. Although this data set is small, you would have a difficult time determining through visual inspection alone which of the attributes (columns) in the data set, if any, were predictive of people's likelihood of accepting (replying to) the invitation. Imagine, for example, trying to determine what influences the probability of response in a response database of more than 10,000 records: Is it job title? Gender? Number of employees or size of sales? Seeing two-variable predictive relationships is difficult, but it is impossible to see the combinations of predictive relationships that produce a strong predictive classification of the likelihood of replying.

Figure 1 shows a decision tree that reveals the predictive structure of the data: Company size (as measured by number of employees) seems to be the strongest predictor of attendance. The overall attendance rate (the number who responded to the invitation) is 40 percent. You can see that 75 percent of the invitees from large companies attended the conference, whereas only 27 percent of the invitees from small companies attended. So, employees of large companies are approximately three times as likely to attend the conference as those of small companies. However, this characteristic of small companies reverses when you consider sales income: In the two cases where sales income for small companies was less than $1 million, attendance is 100 percent. This number reveals that, in this sample, all employees from companies with less than $1 million in annual revenue attended the conference. I show these results as an example; you would never base results on such a small number of records unless you had completed substantial testing with other data sets to verify that this pattern repeats reliably in the target consumer population.

As Figure 1 shows, a decision tree works by collecting the overall data set (which is usually presented as the origin or root node of a decision tree at the top of the figure), and finding ways of partitioning the records or cases that occur in the root node to form branches. The branches are in the form of an upside-down tree, and the nodes at the ends of the branches are usually called leaves.

In a real-life targeted marketing task, you'd have many more attributes (columns) for each potential conference attendee and more potential attendees. When the scale of the problem increases, manually assessing predictive characteristics is difficult, so automatic techniques become necessary. However, assessing the predictive power of multiple attributes remains difficult, although to a lesser extent, even with sophisticated OLAP tools. The role of data mining and decision trees is to provide the ability to assess the combined predictive power of multiple attributes. You can get a sense of how this assessment works by examining Figure 1. You can see that, overall, 40 percent of the data set members responded to the offer. However, when you consider only small firms (EmploySize: Small) that have large sales (SalesSize: $1M+), the response rate drops to about 11 percent. This example shows that you can measure the ability to assess the drop in response rate as more attributes are introduced into the analysis, then use that measure to produce an indicator of the model's predictive power. Decision trees scale well with fields that have attributes with many values and many data records. Because of this scalability, decision trees can be extremely useful for a wide range of predictive modeling and classification tasks.

Segmentation (cluster analysis). Segmentation is the process of grouping or clustering cases based on their shared similarities to a set of attributes. Decision trees also find segments but determine the segments based on a particular outcome variable, such as conference attendance. So the values, which are shown as numeric codes or string values, on one branch of the decision tree form a cluster where the cases in that cluster (a leaf on the decision tree) have a shared similarity in terms of the attribute of the branch that forms the decision tree. For example, in Figure 1, you see that the small, high-dollar-volume customers in the analysis form a segment that has the lowest response rate of any other segment in the decision tree. The decision tree forms a branch that identifies the shared similarity of the cases in the leaf (cluster) and shows the choice that was made between the two outcomes in the node—in this example, Reply or No Reply. If no outcome variable exists or if you want to see how observations group together in terms of their shared values in multiple outcome variables, then cluster analysis is the technique to choose.

Cluster analysis forms groups of cases that are as homogeneous as possible on several shared attributes—such as height, weight, and age—yet are as different as possible when compared with any other clusters that are themselves homogeneous. For example, a cluster analysis might identify all tall, heavy, younger cases in one cluster and all short, light, older cases in another cluster. And cases that have similar purchasing or spending patterns form easily identified market segments toward which you can target different products. In terms of personalized interaction, different clusters can provide strong cues to suggest different treatments.

Several techniques have evolved over time to carry out cluster analysis tasks; one of the oldest of these cluster analysis techniques is K-means cluster analysis. In K-means cluster analysis, the user assigns several means that will serve as bins, or clusters, to hold the observations in the data set. Cases are then allocated to each of the bins depending on their shared similarity. Analysis Services employs a randomly assigned K-means, nearest-neighbor cluster analysis approach.

Implementing OLE DB for Data Mining
OLE DB for Data Mining addresses the utilization of the data-mining interface and the management of the user interface (UI). The solution Microsoft provides in SQL Server 2000 allows several data-mining extensions to this interface and supports data-mining wizards, which guide the user through the data-mining activity. The OLE DB for Data Mining extension enables OLAP applications, user applications, and system services such as Commerce Server and a variety of third-party tools and applications to plug in to an interconnectivity capability. Figure 2 shows a general implementation scenario for OLE DB for Data Mining.

The OLE DB for OLAP and OLE DB for Data Mining specifications give wizards and third-party applications access to data-mining services. In addition to the OLE DB for Data Mining specification, key components of the data-mining environment include the DSO data-mining model and the Data Mining Engine, which comprises both decision tree and cluster analysis. Analysis Services and any OLE DB for Data Mining-compliant third-party process can plug in to this environment to define, create, and manipulate data-mining models. If third-party functionality uses the OLE DB for Data Mining interface, that functionality can be published for use in this environment.

At a system level, Microsoft has extended the DSO model to support the addition of the data-mining model object type. Microsoft created server components to provide a built-in capability to use both OLAP and data-mining capabilities. This capability is a core defining feature of the new Analysis Services. On the client side, the implementation provides client OLAP and Data Mining Engine capabilities to exploit Analysis Services running on the server. The client provides complete access to both OLAP and data-mining models through the OLE DB for Data Mining specification.

Finally, by issuing the OLE DB for Data Mining specification, Microsoft provides a facility for third parties on both the server and client sides to use OLE DB for Data Mining COM interfaces to provide data-mining capabilities as plugins. This plug-in capability lets you add data-mining functionality in environments that conform to the OLE DB for Data Mining specification. Currently, several third-party tool and application vendors, notably the members of the Microsoft Data Warehousing Alliance, provide this kind of extensibility. Three data-mining providers are members of the Data Warehousing Alliance: Angoss Software, DBMiner Technology, and Megaputer Intelligence.

Microsoft designed its Data Warehousing Framework to unify BI needs and solution matching in one fast, flexible, and low-cost foundation. The Data Warehousing Alliance provides solutions in Data Extension, Transformation, and Loading (ETL) tools; analytical applications; querying, reporting, and analysis tools; and data-mining providers. For more information about the Data Warehousing Framework and BI, see http://www.microsoft.com/industry/bi.

The Bottom Line
SQL Server 2000 will have a significant impact on promoting BI awareness and capability just as the evolution of data warehousing, decision-support systems, and OLAP has previously. As enterprises move to higher levels of capability in the capture, storage, and manipulation of data, they are more often using data to analyze business directions and for standard reporting functions. OLAP approaches, introduced with SQL Server 7.0, and data-mining approaches, introduced in SQL Server 2000, are complementary and synergistic data-analysis capabilities. These technologies are now available in a common format, through a common interface, and with a common application interface specification to guide you in creating, using, and deploying data-analysis products. This groundbreaking development will lead to a generalized and unified view of dimensional reporting.



ARTICLE TOOLS

Comments
  • cheng
    11 years ago
    May 07, 2001

    I am just thinking about this topic.
    I found this article helpful.

You must log on before posting a comment.

Are you a new visitor? Register Here