DOWNLOAD THE CODE:
Download the Code 97809.zip

For example, suppose you have a table with 249 or more columns and no indexes. (SQL Server allows up to 1,024 columns in a table.) If you execute the sp_createstats procedure in a database, SQL Server will create statistics on every unindexed column in the entire database, and your table will get 249 statistics, with indid values ranging from 2 to 250. If the table has more than 249 columns, the sp_createstats command will actually report an error, telling you that it can’t generate some of the requested statistics. If you then try to create any nonclustered indexes, you’ll get another error because no more indid values are available.

In SQL Server 2005, the index ID values are stored as ints. The sysindexes compatibility view has rows for all indexes and all column statistics. The sys.stats catalog view contains rows for all statistics, both index statistics and column statistics. Each row in sys.stats has a stats_id value that’s unique for the corresponding object. Every index will have a corresponding statistics row in sys.stats with the same name (that is, the name of the index is the same as the statistics name), and the index_id column in sys .indexes is the same as the stats_id value in sys.stats. However, not every statistics row has a corresponding row in sys.indexes. There’s no overlap between the stats_id values for column statistics and the stats_id values for index statistics. This means that in SQL Server 2005, you’re no longer limited in the number of statistics you can have on a table, and having many column-level statistics won’t affect the number of nonclustered indexes you can have.

When Are Column Statistics Useful?

In general, statistics are used to give the optimizer information about how many rows will satisfy a given predicate in a WHERE clause or an ON clause. However, you might wonder what good it will do the optimizer to know how many rows will satisfy a given condition if there’s no index to use to access those rows. For example, you should be aware that a nonclustered index is typically a good choice only when just a few rows need to be accessed. The SQL Server optimizer uses the index statistics to determine whether there are only a few qualifying rows; if so, the optimizer will choose to use the nonclustered index. Column statistics might tell the optimizer that there are only a few rows, but without an index SQL Server can’t find those few rows efficiently.

For a single-table query with no indexes on the table at all, the only purpose that column statistics might serve is to tell you what columns your queries are referencing, to give you a clue about what columns you might want to build indexes on. However, for queries involving multiple tables, or tables that already have indexes, column statistics can serve a very definite purpose. I’ll show you two examples where the existence of column statistics can help SQL Server choose a better plan.

Choosing a Join Order

If you ran the script in Listing 1, you have a Contact table with 19,972 rows and an Employee table with 290 rows. Neither table has an index. Let’s build an index on the join column in the Employee table:

CREATE INDEX Emp_ContactID
ON Employee(ContactID);

The most efficient join plans usually occur when one table has an index on the join column and the other table is quite small or is returning only a few rows. But you have an index on the join column in the small table, and the other table is quite large. If you run the script in Listing 2 and look at the plan for joining these two tables, you’ll see that the optimizer chooses to do a hash join, which is rarely the most efficient type of join operation.

However, if you add a very restrictive WHERE clause and run the code in Listing 3, you should now see a nested loops join in the optimizer’s plan. Because of the automatic statistics on the FirstName column, the optimizer knows that only a few rows from the Contacts table will be joined to the Employees table, so the filtered Contacts table will be accessed first, followed by the index seek on the Employees table.

If auto_create_statistics wasn’t set to ON, however, the optimizer would have no way of knowing how many rows might satisfy the WHERE condition and wouldn’t risk performing the nested loops join with possibly thousands of rows. To see this situation, turn off the SHOWPLAN option and change the auto_create_statistics setting to OFF, by using the following code:

SET SHOWPLAN_TEXT OFF;
GO ALTER DATABASE AdventureWorks
SET auto_create_statistics OFF;

Now rerun the code in Listing 1 to recreate the two tables, then rerun the code in Listing 3. You should see that the optimizer will no longer choose to use the efficient nested loops join and will use the hash join instead. Column statistics on the FirstName column, even without an index, are crucial to helping the optimizer to choose a good join order and join algorithm.

Secondary Nonclustered Index Columns

Another situation in which column statistics can help is when values in the secondary columns in a nonclustered index are distributed very unevenly. You should be aware that for any composite statistics, SQL Server keeps a histogram for the first column only. To see how this works, set SHOWPLAN OFF and turn on auto_create_statistics, as follows:

SET SHOWPLAN_TEXT OFF;
GO
ALTER DATABASE AdventureWorks
SET auto_create_statistics ON;

Then rerun Listing 1. If you build the following index on the Contacts table, SQL Server will create a histogram of values for only the LastName column:

CREATE INDEX NameIndex ON Contact(LastName, FirstName);

Run the code in Listing 4 to view the plans for two similar queries. One looks for first names that start with K, and there are 1,255 of them. The other looks for first names that start with Y, and there are only 37 of them. Although the NameIndex doesn’t have distribution statistics for FirstName, SQL Server has built column statistics. The optimizer realizes that for the names starting with Y, there are few enough that it’s more efficient to scan the leaf level of the nonclustered index and perform a lookup into the base table 37 times. For the 1,255 rows with a first name starting with K, it’s more efficient for the optimizer to do a table scan, as there are only 568 pages in the table. Reading 568 pages in a scan is better than scanning all 101 pages of the nonclustered index, then following 1,255 pointers to the table, which would require another 1,255 page reads. But when looking for names starting with Y, reading 101 pages in the nonclustered leaf, plus doing 37 table lookups, is better than a table scan.

To see the effect of not having column statistics turned on in this case, you could again turn off auto_ create_statistics, as follows:

SET SHOWPLAN_TEXT OFF; GO ALTER DATABASE AdventureWorks SET auto_create_statistics OFF;

Run Listing 1 again to recreate the tables, and rerun the CREATE INDEX command to build the NameIndex index. Then run Listing 4 again, and you’ll see that now the optimizer chooses to do a table scan on both tables. Without column statistics on the FirstName column, the optimizer can’t know that an index scan plus table lookup will be more efficient than just scanning the entire table.

Keeping the Optimizer Informed

As you’ve seen, by allowing SQL Server to both automatically create column statistics and keep all statistics updated as your data changes, you can ensure that the optimizer has additional useful data for determining query plans. Enabling column statistics can go a long way toward helping the optimizer come up with the most favorable plans for your queries.

End of Article

Prev. page     1 [2]     next page -->



You must log on before posting a comment.

If you don't have a username & password, please register now.

Reader Comments

DOWNLOAD THE CODE and Listing links do not work :-(

seequell

Article Rating 3 out of 5

Thanks, interesting artical

shaunt_khaldtiance@hotmail.com

Article Rating 5 out of 5