Discover who your customers really are
Data mining helps users find hidden patterns and useful business information in large data sets. Corporations can use these patterns and rules to improve their marketing, sales, and customer-support operations by better understanding their customers. These patterns can help solve many business problems, such as figuring out which customers are most likely to leave or discovering what other products certain customers are most likely to be interested in. In these scenarios, the key activities are discovering inherent (but not always obvious) patterns in the data and then, based on the patterns, making predictions.
In today's e-business environment, data mining is beginning to garner more attention. Many organizations are realizing that it has the potential to be an essential component of their IT architecture and business-development strategy. Because data mining is about exploration and analysis, either by automatic or semiautomatic means, large quantities of data can help business analysts uncover meaningful patterns and rules. Corporations have accumulated very large databases from enterprise resource planning (ERP) or customer relationship management (CRM) applications and other operational systems. Data-mining techniques can help you put your data to work as you extract the patterns.
With SQL Server 2000, Microsoft introduced data-mining functionality as part of Analysis Services. Besides adding this functionality to SQL Server 2000, Microsoft joined several leading data-mining providers to initiate the OLE DB for Data Mining API. The API defines a data-mining query language (OLE DB for Data Mining Query Language) that's based on SQL syntax. This language treats data-mining models as a special type of relational table and treats prediction operations as a special kind of join. For explanations of the terms used in this article, see the sidebar "Data-Mining Terminology," page 40. Analysis Services includes the Microsoft data-mining provider, which is based on the OLE DB for Data Mining standard. The new provider includes two data-mining algorithms: Microsoft Decision Trees (MDT) and Microsoft Clustering. Let's look at how you can use each of them to solve typical business problems.
Microsoft Data-Mining Algorithms
Analysis Services ships with the MDT and Microsoft Clustering algorithms, which are the result of many years of research at Microsoft. Let's look briefly at both algorithms. You can find more information about these algorithms at http://citeseer.nj.nec.com/bradley98scaling.html and http://www.acm.org/sigmod/disc/p_scalableclassifsuj.htm.
MDT. The decision tree is probably the most popular technique for predictive modeling. To understand the basic workings of the decision-tree algorithm, let's look at an example. Table 1 shows a set of data that you could use to predict credit risk. In this example, we generated hypothetical information about customers, including their debt level, income level, type of employment, and an evaluation of their credit risk. Figure 1 shows a decision tree that results from this data.
In this example, the decision-tree algorithm might determine that the most significant attribute for predicting credit risk is debt level. The algorithm therefore makes the first split in the decision tree based on debt level. One of the two new nodes (Debt = High) is a leaf node containing three cases with bad credit and no cases with good credit. In this example, a high debt level is a perfect predictor of a bad credit risk. The other node (Debt = Low) is still mixed, having three good credit cases and one bad. The decision-tree algorithm then chooses employment type as the next most significant predictor of credit risk. The split on employment type has two leaf nodes that show that the self-employed people in this example have a higher bad-credit probability.
For the example, we used a small amount of synthetic data to illustrate how the decision tree can use known attributes of credit applicants to predict credit risk. In reality, each credit applicant would typically have far more attributes and the number of applicants would be very large. When the scale of the problem expands, manually extracting the rules that identify good and bad credit risks becomes difficult. But the MDT algorithm can consider hundreds of attributes and millions of records to generate a decision tree that describes rules for credit-risk prediction.
Microsoft Clustering. The Microsoft Clustering algorithm is based on the Expectation and Maximization (EM) algorithm. The EM algorithm iterates between two steps. In the first stepthe "expectation" stepthe algorithm calculates the cluster membership of each case. The cluster membership is the probability that a case belongs to a given cluster. In the second step ("maximization"), the algorithm uses these cluster memberships to reestimate the models' parameters, such as the location and scale parameters of Gaussian distribution.
Figure 2 shows a few iterations of the EM algorithm for a one-dimensional data set. The algorithm assumes that the data is drawn from a mixture of Gaussian distributions, more commonly known as bell curves. In the first row of Figure 2, the algorithm initializes the mixture distribution, which is the mixture of several bell curves here. In the second and third rows, the algorithm modifies the mixture distribution based on the data. The iteration stops when it meets specified stopping criteriafor example, when it reaches a certain likelihood-of-improvement rate between iterations.
Most clustering algorithms load all data points into memory, which causes serious scalability problems when you're processing a large data set. However, the Microsoft Clustering algorithm uses a scalable framework that selectively stores important portions of the database and summarizes other portions. The algorithm basically loads data into memory buffers in chunks and, based on the updated data-mining model, summarizes cases that are close together in a Gaussian distribution, thereby compressing those cases. Therefore, the Microsoft Clustering algorithm needs to scan the raw data only once.