Basket analysis is tricky with an OLAP product
Have you ever needed to answer a question such as, Which products have the most customers? or, How many customers bought these two products together? or, What percentage of our customers buy fewer than 10 items at a time? These questions are examples of distinct count queries, which are tricky to answer with an OLAP product because OLAP products usually store only summarized information.
Distinct count queries differ from normal summations in that queries about items at a dimension-hierarchy level always require the lowest-level hierarchy details to determine the result. For example, if you're counting the number of distinct products purchased in the first quarter, you can't just add the distinct products from January, February, and March. You must know all the products purchased in each of these months to determine the total value for the first quarter because you have to eliminate all the duplicate products. In this article, I explore MDX solutions that either SQL Server 7.0 OLAP Services or SQL Server 2000 Analysis Services can use for problems such as these. I also point out a new Analysis Services feature that can make solving these problems more straightforward.
Let's give credit where credit is due. This article is based in part on a white paper that Amir Netz (Microsoft OLAP Services architect) wrote about basket analysis and OLAP Services. Basket analysis typically refers to retail sales analysis, in which the basket is the group of products purchased in one transaction. If basket analysis is relevant to your work, I highly recommend reading Netz's white paper; he covers some of the same material I do, but in more depth. You can find his white paper "OLAP Services: DISTINCT COUNT and Basket Analysis" on the Microsoft Developer Network (MSDN) Web site.
Market Basket Analysis
I've heard the term basket analysis, or market basket analysis, used in a variety of ways. The most common use refers to a data-mining technique that discovers relationships among products. The classic discovery example is the relationship between diapers and beer. Someone, somewhere, used market basket analysis of the purchasing habits of male customers to show a strong correlation between buying diapers and buying beer. The theory is that men use a run to the store for diapers as an opportunity to replenish their beer supply. Although I'm a father of young twins, I will neither confirm nor deny this theory. Nevertheless, a store manager armed with this information might reorganize shelf space to encourage this behavior and potentially increase beer sales.
The basket analysis technique that I discuss here isn't really data mining because it's not an automated or computer-guided search for correlations. I concentrate on distinct count queries, which are the basis for basket analysis. You must be able to ask distinct count questions to perform manual or automatic searches for correlations.
Distinct count queries are useful in a variety of applications, not just retail sales or marketing applications. In fact, distinct count queries might be useful in any situation in which you have a many-to-many (M:N) relationship among entities. For example, in a call-center analysis application, you could determine how many customer calls were requests for both an address change and a new product purchase. Or in a manufacturing application, you might try to find correlations between types of product failures and types of product components.
Which Products Have the Most Customers?
Now let's examine the following question in detail: Which products have the most customers? A standard cube for sales analysis has Customer, Product, and Time dimensions, and the fact table contains Sales transactions. An individual transaction might say, Customer A bought Product B at Time C for some amount of money. If you create a count measure for this cube, the count might tell you the total number of transactions for any Customer, Product, and Time element combination. This count is useful, but it isn't what I'm looking for. I don't care if the same customer purchased a particular product many times; I want to count unique customers, not unique transactions. To get the number of unique customers for a product, you can use the following MDX expression:
Count( Filter( Customers.Name.Members, Not
IsEmpty( (Product.CurrentMember,
Customers.CurrentMember) ) )
This expression filters all customer names by eliminating those that don't have transactions for the currently selected product. A shorter way to express the same concept is
Count( {Unit Sales} *
Customers.Name.Members, ExcludeEmpty )
This formula cross-joinsdenoted with the asteriskUnit Sales and Customers and counts the non-empty items.
Asking distinct count questions about a subset of customers might also be useful. For example, a variation of the question might be, Which products have the most California customers? You can generalize the count formula to deal with a selected customer subset by replacing the full set of customers with only the descendants of the currently selected customer:
Count( {Unit Sales} * Descendants
( Customers.CurrentMember,
Customers.Name ), ExcludeEmpty )
To create the full query that answers the question, Which products have the most California customers? you might use an MDX expression to create a calculated member. In the MDX query that Listing 1 shows, I limited customers to those in California so that the query wouldn't take too long to execute. On my 600MHz laptop, the query takes about 2.5 minutes to execute.
In Analysis Services, distinct count is now a built-in aggregation type. Thus, you could make a Unique Customers measure when you build the cube. You set the measure type to distinct count and identify the fact-table source column whose contents the measure should count. In this case, you want Unique Customers to count customer names.
Prev. page  
[1]
2
next page