How to conduct high-powered analyses with simple operations in SQL Server 6.5
Although OLAP Services in SQL Server 7.0 is the hot new tool (see Bob Pfeiff, "OLAP: Resistance Is Futile," page 22, for details), you don't need to feel left out if you're still using SQL Server 6.5. You can use standard SQL-92 syntax in SQL Server 6.5 to slice and dice data for decision support. SQL Server 6.5 includes two extensions to GROUP BYCUBE and ROLLUPthat let SQL Server optionally generate aggregate groupings in a single query. You can summarize data by using GROUP BY with one of the aggregate functions, sum, avg, min, max, or count. This action produces a summary value set from which you can answer questions about the distribution of data values.
For example, the Titles table in the SQL Server pubs sample database contains several values for each type of book and publisher. If you write a GROUP BY query with an aggregate function, one query can answer questions such as, "How many books of type 'business' are there?" "Which publisher publishes the most books of type 'modern cooking'?"
These kinds of questions are the hallmark of decision-support systems, OLAP, and data warehousing. The standard SQL-92 language doesn't provide a way to produce a result set that would answer every question about the various data value combinations. In this article, I'll demonstrate the functionality the CUBE and ROLLUP options add to SQL Server queries. These options provide the kinds of summary information that data warehousing applications supply. You can use CUBE and ROLLUP to generate and save this summary information so you won't have to write a separate query for each type of summary information you're interested in.
What Is a Cube?
For the examples in this article, I'll use a software distribution business that sells Microsoft Office products to stores in North America. The table product_sales1 tracks the number of each product sold over some period. This example tracks only sales of Word and Excel. Because the business covers North America, this example tracks the number of each product sold to each of the three countries in North America, in each of three languages. Assume we have the raw data listed in Table 1, the product_sales1 table, on page 30.
To answer questions regarding the number of a product sold in one country, you could formulate a GROUP BY query with an aggregate. For example, you could determine the number of products sold in each language in each country with the following query:
select language, country,
Total_units_sold = sum(units_sold)
from product_sales1
group by language, country
This code would return the results you see in Table 2. From these query results, you could answer: "How many French products were sold in Mexico?" and "In which country were the most Spanish-language products sold?" However, you couldn't answer the question: "Which product sold the best in Canada?" or "How many English-language products were sold?" without doing additional computations. You need to write additional queries to get those answers.
To generate a result set that contains the answer to every question that you could ask regarding the quantities sold per language, per country, or per product, you could generate a cube. A cube represents three columns with changing values (product, language, and country) as its dimensions. A point in the cube's interior represents the quantity sold for specific values of each of the three columns. So from the data in Table 1, the interior point where Word, English, and Canada meet would have the value 49. Figure 1 can help you visualize this point. This figure also shows a cross tab, which is the result of every possible combination of two values.
A cube combines data to produce a result set containing a superset of SQL Server groups. The groups show each value of every grouped column with each possible value of every other grouped column. A cube also contains special values called superaggregates, which represent all values for a particular column. These are the values on the sides and edges of the cube structure in Figure 1. Table 3, page 32, represents the cube for the 18 rows of the product_sales1 table.
In this result set, every existing value for language is combined with every existing value for country, and each combination of language and country is combined with every existing value of product. The rows with <all> are the superaggregates, and they provide additional information. For example, the row with the values <all>, Mexico, French, and a units value of 58 shows that the total of French-language products sold in Mexico for all products is 58 units. The last row with <all>, <all>, Spanish, shows that the total of all products in all countries for the Spanish language was 369 units.
The number of rows SQL Server generates for a cube will depend on the number of attributes you want to group and on the combinations of attributes that exist in the data. Without detailed knowledge of the data, you can't predict the number of rows a cube operation will produce. However, the upper bound will equal the cross product of the number of distinct values + 1 for each attribute. The addition of 1 is for the case of <all>. The product_sales1 example has three attributes: product, country, and language. The 18 rows of the product_sales1 table generate 48 rows in the cube.
Upper bound of number of rows
= (Number of products+1) * (Number
of countries+1) * (Number of languages+1)
= (2+1) * (3+1) * (3+1)
= 48
This upper bound depends only on the number of possible attribute values and is not related to the number of rows in the base table. Even if the table had 10 million rows of sales data, if the data were limited to two products, three countries, and three languages, the cube would never have more than 48 rows.
The CUBE Operator
In the data in Table 3, the placeholder <all> represents the superaggregates. So if you want to know how many French-language products were sold in Mexico, regardless of the specific product, you would find the row where language = 'French', country = 'Mexico', and product = '<all>'.
A simple GROUP BY column WITH CUBE will not show the value <all> in the super-aggregate rows. The <all> value is a nonvalue. Transact-SQL (T-SQL) will return this value as null, but it's a type of null called a grouping null. When used with the CUBE extension, null means "all existing values for this data column." A new function called GROUPING( ) lets you distinguish this special-purpose null from a traditional null, which means "no data available." I'll demonstrate this function in the next section. Table 4 shows a T-SQL query using CUBE and the results the query would return from the data shown above.
Earlier, I showed the formula to predict the upper limit of possible values that a cube can return. SQL Server will return that upper limit only if every possible combination of product, country, and language appears in the data. If the data set is missing a few combinations for a particular time period, the cube will have fewer rows. You can use SQL Server's ISNULL( ) function to get the <all> to appear in the output instead of a null placeholder.
select units = sum(units_sold), product =
isnull(product, '<all>'),
country = isnull(country, '<all>'),
language = isnull(language, '<all>')
from product_sales2
group by product, country, language
with cube
Prev. page  
[1]
2
3
next page