DOWNLOAD THE CODE:
Download the Code 22075.zip

Index statistics are the optimizer's little helper

One of the most important aspects of query tuning is making sure that appropriate indexes are available. In "Are You in Tune?" July 2001, I defined the physical differences between the two kinds of indexes: clustered and nonclustered. In "Time for a Tune-Up," August 2001, and "The Big Cover-Up," September 2001, I showed how you can use indexes to make your queries perform more efficiently. To fully understand indexes and query tuning, you need to consider one more important aspect of their relationship: How does SQL Server decide which index, if any, to use for optimum query performance?

For example, let's look again at Query 1, which I've used in previous articles:

-- Query 1:
USE Northwind
SELECT * FROM orders
WHERE customerID = 'CENTC'

Because only a few rows satisfy this query's search argument, the optimizer chooses a nonclustered index for a seek on the customerID column. If you change the customerID value you're looking for to ERNSH, the optimizer chooses to scan the whole table because many more rows have that value. The query plan says that SQL Server is performing a clustered index scan, but because the leaf level of a clustered index is the data itself, scanning the clustered index is equivalent to scanning the whole table.

Without running the queries, how did the optimizer know that only a few rows would satisfy the request for customerID = 'CENTC' and many rows would satisfy the request for customerID = 'ERNSH'? The optimizer uses index statistics to get the information it needs to determine whether an index is useful. You can think of statistics as a histogram consisting of samplings of values for the index key (or the first column of the key, for a composite index) based on the current data. SQL Server stores these histograms in the sysindexes table's statblob field, which is an image data type. I discussed the organization and use of statistics for SQL Server 7.0 in "SQL Server Statistics: A Useful Query Optimizer Tool," August 1999. Microsoft changed the internal organization of statistics for SQL Server 2000, but the principles are the same, so I won't go into detail in this article. Just be aware that the index statistics tell the optimizer approximately how many times a particular value in a search argument appears in the data. In general, if the number of occurrences is very small, a nonclustered index might be useful. If the optimizer estimates that the number of occurrences of a value is so large that seeking through the nonclustered index and following the pointers for every qualifying data row will be more expensive, the optimizer chooses to have SQL Server scan the whole table.

By examining the query plan, you can see the optimizer's estimate of the number of qualifying rows. When you use the graphical plan in Query Analyzer, you can put your mouse pointer over the icon for the index SQL Server will use, and a yellow pop-up information box shows an estimated row count value. If you're using SET SHOWPLAN_ALL ON to look at the query plan, EstimatedRows is one of the result columns.

In the query plan for Query 1 with either 'CENTC' or 'ERNSH' as the search value, the optimizer's estimate is exactly correct. The optimizer's estimate is usually this accurate only when the table is very small or the number of distinct values is very small. The Orders table contains only 89 distinct customerID values, and because statistics contain detailed information about up to 200 specific data values, the statistics can exactly represent every distinct value. For indexes on larger tables with thousands or millions of rows or distinct values, the estimates aren't always exact, but they're usually close.

As of SQL Server 7.0, you can also create statistics on columns even without creating an index on those columns. In fact, by default, SQL Server 2000 and 7.0 build statistics on any column you use in a search argument, if that column doesn't already have an index. These automatically created statistics have names that start with _WA_sys. In SQL Server 7.0, using the system stored procedure sp_helpindex to ask for a report on all the indexes returns the names of these statistics along with the regular indexes. You can't drop these "indexes" by using the DROP INDEX command; instead, you have to use the DROP STATISTICS command. However, these statistics can be quite useful and they take up very little space, so you probably have no reason to drop them. In SQL Server 2000, these statistics still exist, but the rewritten sp_helpindex procedure doesn't return nonindex statistics. To see a list of all the statistics on a SQL Server 2000 table, you can use the stored procedure sp_helpstats.

So, what can you use statistics for? Suppose you had statistics on the Orders table's customerID column but no index on that column. Creating statistics without an associated index might seem like wasted effort: Even if the statistics told the optimizer that only one row satisfied a query, what good would it do? Without an index to help find that row, SQL Server still has to scan the entire table. However, even without a corresponding index, you can benefit from having statistics because they help the optimizer figure out the best way to process join operations. (Join processing is a huge topic, so I'll save that discussion for another month.)

   Prev. page   [1] 2     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

It explained an issue, which has puzzled me for some time (the concrete use of auto-created statistics).

Jesper Jensen