Learn to trust the optimizer's decisions

If you've upgraded your applications to SQL Server 7.0, you've probably already noticed that many formerly troublesome queries now run in a fraction of the time previously required. Much of this improvement results from a dramatically improved query optimizer, which now has many more options to consider when deciding the best way to execute your query. To make good choices, the optimizer must have meaningful statistical information regarding the distribution of values in the data being queried. In this article, I look at some of the ways that the SQL Server optimizer uses its statistical data. Also, I show how you can look at and interpret SQL Server statistical data, and how SQL Server statistics are maintained.

Index Selection
To determine the most appropriate processing strategy for a particular query, the optimizer gathers and analyzes data on both indexed and nonindexed columns. Whenever you create an index on a table containing data, the optimizer collects and stores statistics for that index. By default, SQL Server automatically updates the statistics whenever it determines that they are out of date. (I discuss the mechanism that SQL Server uses to determine whether the statistics are out of date in the "Updating the Statistics" section.) In addition, the optimizer generates and updates statistics about columns in WHERE clauses.

During query optimization, the optimizer determines whether an index exists for each condition in the WHERE clause. If an index exists, the optimizer first assesses the index's usefulness by determining the selectivity of the clause (i.e., how many rows the query would return), and then estimates the cost of finding the qualifying rows. An index is potentially useful if its first column occurs as a condition in the WHERE clause to establish a lower bound, an upper bound, or both to limit the search. For example, in the following WHERE clause, the condition involving the column name sets a lower and an upper bound for the values in the name column, thereby limiting the number of rows in the result set:

WHERE name LIKE 'M%'

Index Statistics
The optimizer generates two types of statistics: a histogram and density data. Here's a look at how the optimizer gathers and analyzes index data (and data on nonindexed columns) to provide these statistics.

Histogram. A histogram is a graphical representation of a frequency distribution. The histogram is created when an index is created on existing data; running the UPDATE STATISTICS command refreshes the values. If the index is created before data exists in the table, no statistics appear. The histogram would be misleading if statistics were generated when the dispersion of data values significantly differed from the dispersion of the current data in the table.

A histogram consists of a sampling of values for the index key (or the first column of the key for a composite index) based on the current data. In SQL Server 7.0, the histogram can contain as many as 300 values for an index key (or a nonindexed column). When issuing the UPDATE STATISTICS command, you can specify whether you want to gather all occurring values (the FULLSCAN option) or a sample of them. If a table is less than 8MB in size, SQL Server always performs a FULLSCAN. If you don't specify a percentage, SQL Server uses a computed percentage that is a logarithmic function of the table size. When sampling, SQL Server randomly selects pages from the table by following the Index Allocation Map (IAM) chain. (See Inside SQL Server, "The New Space Management," April 1999, for information on IAMs.) After SQL Server selects a page, it uses all the values from that page in the sample. Because disk I/O is the main cost of maintaining statistics, using all values from every page read minimizes maintenance costs.

SQL Server sorts through the values, dividing them into as many as 299 approximately equal intervals. The endpoints of these 299 intervals become the 300 histogram values, which are sometimes referred to as histogram steps. The optimizer's computations assume that the data has an equal number of values between any two adjacent steps, so even if the table or sample has more than 300 rows, the histogram might have fewer than 300 steps to achieve uniformity of step size.

SQL Server stores the histogram's statistics in the statblob field of the Sysindexes table. To fully estimate the usefulness of an index, the optimizer needs to know the number of pages in the table or index; this information is in the dpages column of Sysindexes.

The statblob field in Sysindexes is data type image. (SQL Server stores image data in structures separate from the data row; the data row contains only a pointer to the image data. For simplicity, I refer to the histogram as being stored in the statblob image field.) In addition to the histogram, the statblob field contains the time of the last statistics collection, the number of rows used to produce the histogram and density information, the average row length, and densities for other combinations of columns.

Density data. The statistics information also includes details about the uniqueness, or density, of the data values. Density measures how selective an index is. The more selective an index is, the more useful it is, because higher selectivity means that a query can eliminate more rows from consideration. A unique index is the most selective. By definition, each index entry in a unique index can point to only one row.

The statblob field contains two types of density information: density values and all_density values. The optimizer uses the density and all_density values to determine whether it is better to conduct a table scan or to use an index to access the data. SQL Server computes these values based on the number of rows in the table, the cardinality of the indexed columns (i.e., the number of distinct values), the number of nonfrequent (NF) values (i.e., values that appear no more than once in the histogram steps), and the cardinality of NF values. SQL Server defines these two values as:

density = (NF count/distinct NF count)/(number of rows)
all_density = 1/cardinality of index keys

Density values range from 0 to 1. Highly selective indexes have density values of 0.1 or less. A unique index has a density value of 1/number of rows in the table. For example, a unique index on a table with 8345 rows has a density of 0.00012 (1/8345), which is much less than 0.1. Thus, this unique index is useful because it is selective enough to be more efficient than a table scan.

If the same table has a nonunique nonclustered index with a density of 0.2165, you can expect each index key to point to about 1807 rows (0.2165 * 8345). Driving the query from a nonclustered index means retrieving the pages in index order, one at a time. In other words, SQL Server must individually access an estimated 1807 data pages. Thus, this nonunique nonclustered index is probably not useful because it is not selective enough to be more efficient than a table scan.

In a nonclustered index, the leaf level contains all the index key values, and it is from these key values that SQL Server builds the statistics information. Let's assume that for the nonunique nonclustered index just described, the index leaf pages point directly to the data pages. Also assume that about 40 rows fit on a page, giving you 209 (8345/40) total pages. The chance of two adjacent index entries pointing to the same page is only 0.5 percent (1/209). The number of logical I/Os for just the data pages, not even counting the index I/O, is likely to be close to 1807. In contrast, the entire table can be scanned with just 209 logical I/Os—the number of pages in the table. In this case, the optimizer looks for better indexes or decides that a table scan is the best it can do.

   Prev. page   [1] 2     next page



You must log on before posting a comment.

If you don't have a username & password, please register now.