• subscribe
April 26, 2001 12:00 AM

Distinct Count Queries

SQL Server Pro
InstantDoc ID #20169
Downloads
20169.zip

You need to be aware of some limitations that Analysis Services puts on distinct count measures. One such limitation is that the cube with the distinct count measure can't contain a dimension that has custom rollup operators or custom rollup formulas. Another limitation is more significant: A cube can have only one distinct count measure. Fortunately, this limitation applies only to nonvirtual cubes. In fact, Microsoft recommends that you make as small as possible any cube that has a distinct count measure, then combine that cube with another cube to form a virtual cube. The smallest cube possible would contain no measures other than the distinct count measure and only the dimensions relevant for distinct count queries.

If you isolate the distinct count measure in a small cube, Analysis Services will likely use less memory and perform distinct count queries much more quickly. This approach improves performance because Analysis Services can determine that a distinct count requires only the information from the smaller distinct count cube. Therefore, Analysis Services accesses much less information to determine the result. This cube structure optimization is particularly important with distinct count because all distinct count queries access the dimensions' most detailed information. Analysis Services usually accesses only summarized information to execute non-distinct count queries.

Which Products Are Purchased Together?
Classic market basket analysis also deals with questions such as, Which products are purchased together? Before getting into the MDX you need for a basket analysis query, let's discuss the definition of together. Netz's white paper contains a good discussion of this concept. To summarize, Sales cubes don't typically retain invoice information such as sales receipts, so a cube can't use an invoice to decide whether two products were purchased together.

Another point Netz makes is that in decision making, treating two transactions that occur within a certain elapsed time as the same transaction can be important. For example, a department store customer might buy something in the store, then continue shopping, find something else, and make another purchase. This customer's transactions would be on the same day, but perhaps not on the same cash-register receipt. Netz's point is that for decision-making purposes, using a time period—such as a day or a week—to bind a transaction to a customer is adequate and probably preferable. This approach works well with OLAP because multidimensional databases almost always have Time dimensions, and MDX is good at dealing with time periods. In the FoodMart 2000 Sales cube, the Time dimension doesn't break down into days; so for these MDX examples, I use the smallest possible time unit represented in the Sales cube—months.

Now let's look at a specific example of a basket analysis question: What beer product is purchased most frequently with pizza? I can imagine a store manager wanting to know the answer to this question to decide whether to put a special beer display next to the pizza freezer. To solve this problem, I broke the question into pieces to make it easier to understand.

First, I figured out the number of qualifying transactions that a particular customer had. The following MDX statement calculates the qualifying transactions by filtering out all the time periods that don't include both pizza and the current product. The number of remaining time periods equals the number of transactions.

Count(Filter(Time.Month.Members, Not
 IsEmpty((Unit Sales, Pizza)) and Not
 IsEmpty((Unit Sales, Product.CurrentMember))))

Second, I needed to determine the total number of transactions for all customers. Cross-joining the selected customers with the number of transactions per customer and summing the results accomplished this step:

Sum( {Tx per Customer} *
 Descendants(Customers.CurrentMember,
  Customers.Name))

Listing 2 shows how you can combine the previous two formulas to create a full MDX query that answers the question. In a real basket analysis application, limiting queries, as I did here with California, might be useful because the total number of transactions isn't important; finding the relationships among products is. You can find the relationships by sampling only part of the data. So, the example that Listing 2 shows is valid (as long as you believe that customers in California form a representative sample). Figure 1 shows the answer to this query. The results are disappointing because they aren't what you'd expect. If FoodMart contained actual purchase data, I'm sure that a strong correlation between pizza and beer would exist!

You can easily extend the solution for the purchased-together problem to check for occurrences of more than two products together. Just change the formula for Tx per Customer to include more products. Unfortunately, Analysis Services' new distinct count aggregation type doesn't help with this type of query. The distinct count aggregation works only if you want the distinct number of items in a fact-table column. However, the purchased-together example searches for the distinct number of a particular combination of items in a column.

Cube Analysis
The kinds of questions that distinct count queries can answer are invaluable in analyzing retail product sales. The example I used—What beer product is purchased most frequently with pizza?—is a good example of this type of question. However, this type of analysis is also useful in any cube. Look at the human resources cube in the FoodMart 2000 database as an example. The Number of Employees measure is a distinct count measure that you can use to determine how many employees work at a particular store or under a certain manager. And because the underlying fact table contains paycheck transactions, you can't determine these answers with a typical count measure. Note that you can use the techniques I demonstrated in this article to ask distinct count questions in your applications as well. Test your distinct count query skills by trying to answer the "May MDX Puzzle," page 64. "April MDX Puzzle Solution Revealed," page 65, shows the solution to last month's puzzle. Until next time, send your article ideas to me at olapmasters@sqlmag.com.



ARTICLE TOOLS

Comments
  • Johnson Au-Yeung
    11 years ago
    Nov 30, 2001

    Your example covers much of the same ground as Amir's.
    I am very much interested in something you said about using SQL 2000's distinct count aggregation type.
    Is there any way to use this new aggregation in an MDX query to speed up a basket analysis query - like your pizza example?

You must log on before posting a comment.

Are you a new visitor? Register Here