More ways that indexes can make or break your query performance
SQL Server internally processes GROUP BY and DISTINCT clauses in a similar manner. For both types of queries, SQL Server returns one row of output for each distinct value of a column or set of columns in the input. The difference is that with a GROUP BY query, you can optionally include an aggregation such as count(), sum(), or avg() to be performed on all the rows with matching values for the grouping column. For example, using the Northwind database's Orders table, suppose you want to know which customers have placed orders. In other words, you want to see a list of distinct customer IDs. In T-SQL, you can write the query like this:
-- Query 1:
SELECT DISTINCT customerID
FROM Orders
Alternatively, you can get the same output by using GROUP BY (without an aggregate function):
-- Query 2:
SELECT customerID
FROM Orders
GROUP BY customerID
SQL Server 2000 and 7.0 process both queries in exactly the same way, and the results are identical. The queries return one row of output for each unique CustomerID value. The following plan shows that to process this query, SQL Server uses a technique called stream aggregation:
|--Stream Aggregate(GROUP BY:([Orders]
.[CustomerID]))
|--Index Scan(OBJECT:([Northwind]
.[dbo].[Orders].[CustomerID]), ORDERED FORWARD)
I discuss some details of this plan later. But first, let's look at another example with GROUP BY.
The GROUP BY construct lets you perform further processing for each customer ID beyond simply listing the distinct values. By adding an aggregate function like sum() to the SELECT list, you're asking for summary information for each unique customer ID. The following query requests the sum of all the freight charges for each customer:
-- Query 3:
SELECT customerID, sum(freight)
FROM orders
GROUP BY customerID
Again, the query returns only one row for each customer, no matter how many rows in the Orders table have that CustomerID value. And for each customer ID, you get summary information based on all the rows that have that CustomerID value.
However, as Figure 1 shows, the query plan for Query 3 is a bit different from the previous two; it uses a technique called hash aggregation, which doesn't imply any sorting. I'll forgo the discussion of exactly what hash aggregation involves until next time.
Stream aggregation and hash aggregation are the two ways that SQL Server can process both GROUP BY and DISTINCT queries. For stream aggregation, SQL Server sorts all the data first. For a GROUP BY query, the grouped columns are the sort key; for a DISTINCT query, the sort key comprises the columns in the SELECT list. After the data is sorted, isolating the distinct values and returning one row for each is relatively straightforward. When SQL Server uses stream aggregation to process GROUP BY or DISTINCT, the results are returned in sorted order. If you execute Queries 1 and 2 above, SQL Server automatically sorts the result set by CustomerID. When you execute Query 3, which uses hash instead of stream aggregation, the data comes back in seemingly random order.
Before SQL Server 7.0, SQL Server could perform GROUP BY and DISTINCT operations only by using stream aggregation, so the data always came back sorted. Consequently, many SQL Server programmers erroneously assumed that GROUP BY implied sorting and were confused when some grouped results in SQL Server 7.0 didn't come back sorted. But when the optimizer in SQL Server 2000 or 7.0 decides to use hash aggregation, the results aren't ordered. If you need data in a particular sorted order, you must use an ORDER BY clause in your query.