DOWNLOAD THE CODE:
Download the Code 43981.zip

A data warehouse contains data that has been aggregated for data analysis. You aggregate measures (such as quantities and amounts) for different entities (such as employees, customers, and products). You also determine the time units (e.g., day, month, year) for the aggregations according to your analysis needs. To allow fast query response, you might keep multiple summary tables, each containing a different time unit for the aggregations—for example, a table for daily aggregations, another for monthly aggregations, and a third for yearly aggregations. A group of calculations called cumulative aggregations requires you to perform additional manipulation of the summary tables in your data warehouse. They're called cumulative because you need to return a cumulative aggregate of all the rows either from a certain point in time until the current row's point in time or until some condition is met. For example, for each employee and month, you could return the current month's order quantity and the total number of orders from the beginning of the employee's activity until the current month. Let's look at five example cumulative-aggregation problems and explore their solutions.

Cumulative Aggregations
The examples use an employee monthly orders summary scenario. First, run the code in Listing 1 to create the sample EmpOrders table and populate it with sample data—specifically, monthly order quantities placed by each employee. Each row contains the order month and year in the ordermonth column, the employee ID in the empid column, and the total monthly order quantity in the qty column. My design decision to use a datetime data type for the ordermonth column might seem strange. You need to store only a year and a month to represent the unit of time in this case, but a valid datetime value must also contain a day portion. To populate the ordermonth column with valid dates, you can specify the first of each month in the day portion and ignore it when you analyze the data. So why don't I store the year and month in a character-based column in the form YYMM if the day portion isn't required? I prefer to store years and months in a datetime column because doing so allows easy manipulation (such as calculating date differences or adding date parts) through datetime functions. Such manipulation isn't possible with integers or character strings that contain only the year and month portions of a date. I don't use datetime functions in this article, but I wanted to follow my best practices for datetime values in this case. Now let's walk through the five cumulative-aggregation problems.

1. Cumulative totals for each employee by month. Suppose you want to analyze the progress each employee made during a period of time. For each employee and month, you want to return the total and average order quantities from the employee's first order month through the current month. Figure 1 shows the desired result (abbreviated). The solution for this request is fairly simple. You use a join to match each row from one instance of the EmpOrders table (call it O1) with all the rows from another instance of the same table (call it O2) that have the same employee ID and an earlier or identical order month. Because a row in O1 might have more than one match in O2, SQL Server duplicates each row from O1 for every match in O2 that contains the quantities you want to aggregate. You then group the result by all attributes you want to return from O1—in this case, O1.empid, O1.ordermonth, and O1.qty. Finally, you return those attributes together with the total and average quantities from O2, as Listing 2 shows.

2. Cumulative totals up to 500 for each employee. A slight twist to the first problem is to return only the cumulative aggregations up to the month the cumulative total quantity reaches or passes a certain threshold—say, 500. Figure 2 shows the desired result (abbreviated). You need to add a filter to the query, and you must specify the filter in the HAVING clause because SQL Server applies the HAVING clause to the grouped rows. SQL Server applies the other filters that you can specify in a query (ON and WHERE) before grouping the rows. You might be tempted to use the filter SUM(O2.qty) <= 500; however, such a filter won't return the correct result for an employee unless the cumulative total for a month reaches exactly 500. If the month in which the total reaches 500 has a cumulative total greater than 500, you won't get the row that contains that order month in the result. So, you need to devise a sophisticated filter that can accommodate a situation where cumulative totals don't reach exactly 500. Because each row in the grouped result contains both a cumulative quantity (SUM(O2.qty)) and the current month's quantity (O1.qty), you can use the filter

SUM(O2.qty)  SUM(O1.qty) < 500

This logical expression is true for all months in which the total quantity is less than 500 and for the first month in which the total reaches or surpasses 500. Listing 3 shows the complete solution query for problem 2.

3. Cumulative total reaches 500 for each employee. The third problem adds another twist to the original problem. Suppose you want only one result row for each employee, showing where the employee reached or passed the 500-item barrier; you're not interested in the history leading to that point. To achieve the desired result, which Figure 3 shows, you need to revise the HAVING filter from the previous query so that the logical expression returns FALSE for all rows except the last. Think about the problem this way: The last row is unique because it's the only row where the total quantity reaches or passes 500. So all you need to add to the HAVING filter is AND SUM(O2.qty) >= 500, as Listing 4's query shows.

   Prev. page   [1] 2     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

In my understanding the primary index includes an unnecessary field - qty. I can not see the reason and also it makes the index wider...

pkpetroff

Article Rating 4 out of 5

Excellent article

Valdemar from brazil

Anonymous User

Very Intresting article!!!

BTW (for pkpetroff ;-). He have added the "unnecessary" field so he will have "covering Index" with this field...

Karmona,moti

Article Rating 5 out of 5