But if you need to compute count(<column>), SQL Server needs to know how many rows for that particular column have a value. So, you might think that the particular column needs to be part of any covering index. In SQL Server 7.0 and earlier, an index has to include the aggregated column to be used as a covering index. However, the SQL Server 2000 query optimizer can figure out that if the column doesn't allow NULLs, then count(<column>) is the same as count(*) because all the rows must contain a non-NULL value for the column.
The script in Listing 2 illustrates this point by creating a copy of the Order Details table, then building a clustered index on OrderID and a nonclustered index on ProductID. The count() aggregation is on the Quantity column, which initially doesn't allow NULLs. The query plan for a query involving count(*) is the same as the plan for a query involving count(Quantity) and includes stream aggregation because the data is already sorted by ProductID. Turning on STATISTICS IO shows the same number of logical reads for both queries. The script then drops the Quantity column and recreates it, allowing NULLs. The same query's plan now uses hash aggregation, and the number of logical reads increases.
Besides making use of a covering nonclustered index, a GROUP BY query could also use a clustered index on the grouping column because, just as in the covering case, the data is already in the proper order for stream aggregation. The query plan for Query 8 shows a clustered index scan and no sorting; the plan doesn't even include any aggregation operators:
-- Query 8:
SELECT OrderID, sum(Quantity)
FROM [Order Details]
GROUP BY OrderID
As I mentioned, in SQL Server 2000 and 7.0, ordering of the output rows isn't guaranteed unless you include an ORDER BY clause (or you're running in 60 or 65 compatibility mode). The following examples show what happens to the query plan when you add an ORDER BY statement to sort by the grouping column.
The plan for Query 9 uses stream aggregation, so the data already comes back sorted by the grouping column:
-- Query 9:
-- Stream aggregation, with or without ORDER BY
SELECT ShipName, max(Freight)
FROM Orders
GROUP BY ShipName
--ORDER BY ShipName
If you run the query with and without including the ORDER BY, the plans are the same. If the original query plan for a GROUP BY query uses hash aggregation, adding an ORDER BY clause forces a sort operation. However, the sort can occur either before or after the aggregation. I've found that because SQL Server can quickly sort small amounts of data, in most cases involving very small tables (less than 1000 rows or so), SQL Server performs the sort before the aggregation. The plan for Query 10 in Listing 3 shows hash aggregation; when Query 11 in Listing 3 adds ORDER BY, the plan changes to include the SORT operation before the aggregation. Because the data is sorted first, the aggregation is a stream aggregate.
Query 12 in Listing 4 also has a plan with a hash aggregation, but that query is based on the larger Order Details table. Query 13 in Listing 4 shows that the sort takes place after the hash aggregation.
In many cases, SQL Server can efficiently process queries that involve GROUP BY or DISTINCT by using either a stream aggregation on sorted data or a hash aggregation on unsorted data. You can tune these queries by either making sure that a covering nonclustered index exists or by creating a table's clustered index on the columns in the GROUP BY clause or on the columns in the SELECT list after the DISTINCT keyword.